A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

Chapter 7Experimental evaluationThis chapter presents the per<strong>for</strong>mance <strong>of</strong> the code generated <strong>for</strong> OpenMP and on severalhighly parallel kernel benchmarks. Per<strong>for</strong>mance was evaluated against the generated serialC++ code. For each kernel benchmark, the execution time <strong>of</strong> the generated serial codewas compared against the total execution time <strong>of</strong> generated OpenMP and GPU versions.Per<strong>for</strong>mance on the GPU was measured with and without loop optimizations. For GPUper<strong>for</strong>mance, four numbers are reported:1. GPU total execution time with GPU-specific loop optimizations enabled in jit4GPU.The time reported includes data transfers and JIT compilation overheads.2. GPU execution time only with loop optimizations enabled. Only the time taken bythe GPU to execute the GPU binary are included and does not include data transfersand JIT compilation times.3. GPU total execution time without any GPU-specific loop optimizations.4. GPU execution only without any GPU-specific loop optimizations enabled.The kernels chosen <strong>for</strong> per<strong>for</strong>mance evaluation were matrix multiplication, CP benchmarkfrom the Parboil benchmark suite, Black Scholes option pricing, 5-point stencil codeand RPES kernel from the Parboil benchmark suite. The objective <strong>of</strong> the per<strong>for</strong>manceevaluation is to study the per<strong>for</strong>mance gains when GPU code generation is enabled againstthe per<strong>for</strong>mance <strong>of</strong> an OpenMP version <strong>of</strong> each benchmark. There<strong>for</strong>e, benchmarks wereall chosen to be highly parallel kernels which are suitable <strong>for</strong> execution on the GPU. Amongthe chosen benchmarks, the memory access pattern in four <strong>of</strong> the benchmarks is describableby RCSLMADs and the compiler is able to generate GPU code. In benchmark RPES, loopsare triangular with indirect memory references and the compiler was unable to generateGPU code.A more comprehensive study <strong>of</strong> the percentage <strong>of</strong> cases in which the compiler is ableto generate GPU code will require a standard benchmark suite implemented in <strong>Python</strong>52
with suitable GPU code annotations. Implementing such a benchmark suite is left as futurework. However, as mentioned in chapter 5, jit4GPU is only able to generate GPU code whenthe memory access pattern is describable by RCSLMADs. There<strong>for</strong>e, some examples <strong>for</strong>which jit4GPU will be unable to generate a GPU version includes FFT, conjugate gradientalgorithms and matrix solvers with triangular loops.The important results from the experiments are:1. Using a GPU delivered upto 100 times speedup over generated OpenMP code runningon the CPU.2. Loop optimizations per<strong>for</strong>med by jit4GPU deliver upto four times per<strong>for</strong>mance improvementon the GPU.3. Benchmarks that per<strong>for</strong>m very little computation per data item accessed are not suitable<strong>for</strong> the GPU because the data transfer overhead is larger than the computationtime in such cases.All experiments were done using a AMD Phenom X4 9550 (2.2ghz quad-core) pairedwith a Radeon HD 4870 and 4gb <strong>of</strong> RAM. Frequency scaling was kept disabled on the CPU.The operating system was Ubuntu 8.10 with Linux kernel version 2.6.27-7 and GCC version4.3.2. For compiling C++ code, the optimization flag -O3 was passed to the CPU. Whencompiling <strong>for</strong> OpenMP, the flag -fopenmp was also passed. Several other flags were alsotested but they produced no notable per<strong>for</strong>mance changes and were exluded in the resultspresented here.Each experiment was repeated 5 times and the minimum, maximum and mean executiontimes are presented.7.1 Matrix multiplicationMatrix multiplication was implemeneted <strong>for</strong> 32-bit and 64-bit floating point matrices. Avery simple implementation <strong>of</strong> matrix multiplication was done in <strong>Python</strong> and the outer twoloops were marked as parallel loops <strong>for</strong> GPU execution. Per<strong>for</strong>mance was studied againstthe matrix sizes. For comparing per<strong>for</strong>mance <strong>of</strong> the generated GPU code against the CPU,the per<strong>for</strong>mance results <strong>of</strong> the generated OpenMP code as well as per<strong>for</strong>mance results fromATLAS library are included. ATLAS implements a tiled matrix multiplication algorithmand also autotunes to best fit the system CPU at the time <strong>of</strong> installation. ATLAS is a highper<strong>for</strong>mance library with many years <strong>of</strong> development ef<strong>for</strong>t. By comparision, the <strong>Python</strong>implementation was a straight<strong>for</strong>ward implementation <strong>of</strong> matrix multiplication written inunder ten lines <strong>of</strong> <strong>Python</strong> code. From this <strong>Python</strong> source, the compiler was able to generate53
Page 3 and 4:
Examining CommitteeJose Nelson Amar
Page 5 and 6:
AbstractModern Graphics Processing
Page 8 and 9:
5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11:
List of Figures2.1
Page 12 and 13:
RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15: much lower than equivalent C progra
Page 16 and 17: Chapter 2AMD RV770 architecture <st
Page 18 and 19: Figure 2.1: RV770 Block Diagram. So
Page 20 and 21: 2.2 AMD CAL Programming modelVariou
Page 22 and 23: a context. The predefined names are
Page 24 and 25: 2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51 and 52: Algorithm 1 solves the data transfe
Page 53 and 54: 0 ≤ j < 5 (5.26)In the example, t
Page 55 and 56: case, assume we had a C array start
Page 57 and 58: solve for unknown
Page 59 and 60: approach, the number of</st
Page 61 and 62: Chapter 6Loop transfor</str
Page 63: Algorithm 6 perfor
Page 67 and 68: Table 7.3: Execution time f
Page 69 and 70: time and the execution time on the
Page 71 and 72: suite. The serial version o
Page 73 and 74: However Shedskin does not support e
Page 75 and 76: thread. Device functions are writte
Page 77 and 78: Chapter 9ConclusionsThis thesis int
Page 79: [14] Francois Labonte, Peter Mattso
show all

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?