A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

with AMD architecture. Their loop optimizations are completely different than the loopoptimizations described in this thesis because the architectures are very different.Another work that extends OpenMP <strong>for</strong> GPGPU programming is EXOCHI frameworkby Wang et al [23]. EXOCHI is an extension <strong>of</strong> OpenMP <strong>for</strong> C/C++ <strong>for</strong> heterogenoussystems. Their implementation is <strong>for</strong> a multicore x86 CPU and <strong>for</strong> an integrated Intelgraphics chipset. Unlike the discrete GPUs considered in this thesis, such as the Radeon4870, Intel graphics chipsets are integrated into the northbridge <strong>of</strong> the CPU and do not siton a PCIe bus. These integrated chips also do not have a separate onboard memory and canaccess the system RAM. EXOCHI there<strong>for</strong>e does not need to copy data and instead onlyneeds to remap the memory address translation table from CPU to the GPU. The addresstranslation remapping is handled by EXOCHI’s runtime. To program the GPU, EXOCHIrequires the programmer to write GPU code but does not require the programmer to do anydata transfers because data transfers are not necessary. Instead, the GPU code can directlyaccess any data in system RAM thereby simplifying the programming. EXOCHI is onlysuitable <strong>for</strong> systems where both the CPU and the accelerator (such as the GPU) can accessthe system RAM directly and where the address translation table can be simply remapped.There<strong>for</strong>e EXOCHI is not applicable to current generation discrete GPUs.8.4 ConclusionsThis thesis describes the first <strong>Python</strong> compiler to provide simple parallel programmingsupport <strong>for</strong> numerical applications. The implented compiler is also one <strong>of</strong> the first to automaticallymap a shared-memory parallel programming model to a GPGPU system. Thisthesis describes a new algorithm to automatically transfer relevant data between a CPUand a GPU. The implemeneted compiler provides a programming model that is simpler toprogram than current GPGPU APIs such as CUDA and that relies on compiler analysisand optimization to automatically generate GPU code.64
Chapter 9ConclusionsThis thesis introduced a new programming model <strong>for</strong> more efficient programming <strong>of</strong> numericalprograms in <strong>Python</strong> <strong>for</strong> execution on GPUs. The thesis also described the design andimplementation <strong>of</strong> a compiling system to convert numerical <strong>Python</strong> programs annotatedwith type and parallel loop annotations to multi-cores and GPUs. In this new programmingmodel, a programmer writes code <strong>for</strong> a simple shared-memory abstraction and the compilerautomatically converts the program to use a GPU as an accelerator. The program remainsportable to multicores and GPUs with no code changes.The compiler system consists <strong>of</strong> un<strong>Python</strong>, an ahead-<strong>of</strong>-time compiler and jit4GPU,a just-in-time compiler. Jit4GPU implements a new algorithm to analyze the regions <strong>of</strong>memory accessed by an array reference in a loop nest. The algorithm is restricted to a class<strong>of</strong> affine accesses termed as RCSLMADs. Jit4GPU automatically transfers the required data<strong>for</strong> the computation between the CPU and the GPU based on the results <strong>of</strong> the array accessalgorithms. Jit4GPU is not a general-purpose JIT compiler and only works on numericalprograms represented as parallel loop nests with array accesses representable as RCSLMADs.Jit4GPU generates GPU code from a typed abstract-syntax-tree (AST) representation <strong>of</strong> the<strong>Python</strong> program generated by un<strong>Python</strong>. Jit4GPU also per<strong>for</strong>ms several loop optimizationssuch as loop unrolling and memory load coalescing.The per<strong>for</strong>mance evaluation used several numerical kernels. On some kernels, Jit4GPUper<strong>for</strong>ms over 100 times faster than OpenMP code generated by un<strong>Python</strong>. Jit4GPU alsodelivers better per<strong>for</strong>mance than some highly tuned CPU libraries, such as ATLAS, withoutrequiring the programmer to do any optimizations such as unrolling or tiling in the original<strong>Python</strong> source. <strong>Compiler</strong>s, such as Jit4GPU, allow the programmer to easily utilize thecomputational power <strong>of</strong> modern GPUs <strong>for</strong> general purpose computation.65
Page 3 and 4:
Examining CommitteeJose Nelson Amar
Page 5 and 6:
AbstractModern Graphics Processing
Page 8 and 9:
5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11:
List of Figures2.1
Page 12 and 13:
RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15:
much lower than equivalent C progra
Page 16 and 17:
Chapter 2AMD RV770 architecture <st
Page 18 and 19:
Figure 2.1: RV770 Block Diagram. So
Page 20 and 21:
2.2 AMD CAL Programming modelVariou
Page 22 and 23:
a context. The predefined names are
Page 24 and 25:
2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51 and 52: Algorithm 1 solves the data transfe
Page 53 and 54: 0 ≤ j < 5 (5.26)In the example, t
Page 55 and 56: case, assume we had a C array start
Page 57 and 58: solve for unknown
Page 59 and 60: approach, the number of</st
Page 61 and 62: Chapter 6Loop transfor</str
Page 63 and 64: Algorithm 6 perfor
Page 65 and 66: with suitable GPU code annotations.
Page 67 and 68: Table 7.3: Execution time f
Page 69 and 70: time and the execution time on the
Page 71 and 72: suite. The serial version o
Page 73 and 74: However Shedskin does not support e
Page 75: thread. Device functions are writte
Page 79: [14] Francois Labonte, Peter Mattso
show all

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?