A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

the element and its 4 neighbors and writes the result to the corresponding element in amatrix <strong>of</strong> the same dimensions. The kernel is highly parallel as each element is processedindependently but the amount <strong>of</strong> computation per point is very small. The compiler wasable to successfully carry out the array access analysis and was able to generate the GPUcode <strong>for</strong> this benchmark. However, the data transfer overhead considerably outweighed thecomputation time savings <strong>of</strong> the GPU. There<strong>for</strong>e, the benchmark ran slower when usinga GPU compared to OpenMP. The results are reported in Table 7.9. For each case, theminimum, maximum and mean execution times are presented.Table 7.9: Execution time <strong>for</strong> 5-point stencil benchmark (milliseconds)Problem size 1024 2048 3072 4096Serial Time min 10.8 43.3 97.5 175max 10.9 43.7 98.2 176.6mean 10.8 43.4 97.7 175.8OpenMP 1 thread min 10.1 46 70 150max 10.26 47 71.24 152mean 10.17 46.3 70.4 151OpenMP 4 threads min 5 24 47.3 58.1max 5 24 47.8 58.5mean 5 24 47.5 58.2GPU Total (Opt) min 65 98.8 123.1 1010max 65.8 99.6 124.9 1016mean 65.3 99.1 123.7 1014GPU Only (Opt) min 0.39 0.89 2.8 25max 0.39 0.89 2.8 26mean 0.39 0.89 2.8 25.5GPU Total (No Opt) min 35.3 63.1 112.1 1000.3max 35.6 63.7 112.7 1006mean 35.4 63.25 112.3 1002GPU Only (No Opt) min 0.59 2.0 4.2 50max 0.59 2.0 4.2 52mean 0.59 2.0 4.2 517.5 RPES benchmarkRPES benchmark is a <strong>Python</strong> adaption <strong>of</strong> the benchmark from Parboil suite. The benchmarkinvolves indirect memory loads and triangular loops. Un<strong>Python</strong> determined that aGPU version cannot be generated and there<strong>for</strong>e did not generate calls to the JIT compiler.There<strong>for</strong>e, the per<strong>for</strong>mance did not change when the GPU was enabled because theGPU was never used and the JIT compiler was not called. The compiler only generatedan OpenMP version <strong>of</strong> the benchmark. This benchmark illustrates that the programmercan safely add GPU parallel annotations without any fear <strong>of</strong> errors in the case <strong>of</strong> compilerlimitations.The benchmark was only tested with default parameters provided by the benchmark58
suite. The serial version <strong>of</strong> the benchmark completed in 259 seconds and the parallel versioncompleted in 62 seconds.7.6 RemarksThe section evaluated the implemented compiler over several kernel benchmarks. The generatedGPU code delivered over 2 orders <strong>of</strong> magnitude per<strong>for</strong>mance over OpenMP code <strong>for</strong>some benchmarks. Jit4GPU was also able to deliver better per<strong>for</strong>mance using the GPUthan highly tuned CPU libraries such as ATLAS. The loop optimizations done by the compilerwere also found to be effective and provided considerable speedups over unoptimizedGPU code. The per<strong>for</strong>mance <strong>of</strong> some benchmarks, such as 5-point stencil, were found to bebound by the data transfer overhead making it unsuitable <strong>for</strong> execution on the GPU.59
Page 3 and 4:
Examining CommitteeJose Nelson Amar
Page 5 and 6:
AbstractModern Graphics Processing
Page 8 and 9:
5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11:
List of Figures2.1
Page 12 and 13:
RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15:
much lower than equivalent C progra
Page 16 and 17:
Chapter 2AMD RV770 architecture <st
Page 18 and 19:
Figure 2.1: RV770 Block Diagram. So
Page 20 and 21: 2.2 AMD CAL Programming modelVariou
Page 22 and 23: a context. The predefined names are
Page 24 and 25: 2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51 and 52: Algorithm 1 solves the data transfe
Page 53 and 54: 0 ≤ j < 5 (5.26)In the example, t
Page 55 and 56: case, assume we had a C array start
Page 57 and 58: solve for unknown
Page 59 and 60: approach, the number of</st
Page 61 and 62: Chapter 6Loop transfor</str
Page 63 and 64: Algorithm 6 perfor
Page 65 and 66: with suitable GPU code annotations.
Page 67 and 68: Table 7.3: Execution time f
Page 69: time and the execution time on the
Page 73 and 74: However Shedskin does not support e
Page 75 and 76: thread. Device functions are writte
Page 77 and 78: Chapter 9ConclusionsThis thesis int
Page 79: [14] Francois Labonte, Peter Mattso
show all

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

Create successful ePaper yourself

Delete template?

Save as template?