11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7Experimental evaluati<strong>on</strong>This chapter presents the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> the code generated <str<strong>on</strong>g>for</str<strong>on</strong>g> OpenMP and <strong>on</strong> severalhighly parallel kernel benchmarks. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance was evaluated against the generated serialC++ code. For each kernel benchmark, the executi<strong>on</strong> time <str<strong>on</strong>g>of</str<strong>on</strong>g> the generated serial codewas compared against the total executi<strong>on</strong> time <str<strong>on</strong>g>of</str<strong>on</strong>g> generated OpenMP and GPU versi<strong>on</strong>s.Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>on</strong> the GPU was measured with and without loop optimizati<strong>on</strong>s. For GPUper<str<strong>on</strong>g>for</str<strong>on</strong>g>mance, four numbers are reported:1. GPU total executi<strong>on</strong> time with GPU-specific loop optimizati<strong>on</strong>s enabled in jit4GPU.The time reported includes data transfers and JIT compilati<strong>on</strong> overheads.2. GPU executi<strong>on</strong> time <strong>on</strong>ly with loop optimizati<strong>on</strong>s enabled. Only the time taken bythe GPU to execute the GPU binary are included and does not include data transfersand JIT compilati<strong>on</strong> times.3. GPU total executi<strong>on</strong> time without any GPU-specific loop optimizati<strong>on</strong>s.4. GPU executi<strong>on</strong> <strong>on</strong>ly without any GPU-specific loop optimizati<strong>on</strong>s enabled.The kernels chosen <str<strong>on</strong>g>for</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> were matrix multiplicati<strong>on</strong>, CP benchmarkfrom the Parboil benchmark suite, Black Scholes opti<strong>on</strong> pricing, 5-point stencil codeand RPES kernel from the Parboil benchmark suite. The objective <str<strong>on</strong>g>of</str<strong>on</strong>g> the per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceevaluati<strong>on</strong> is to study the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance gains when GPU code generati<strong>on</strong> is enabled againstthe per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> an OpenMP versi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> each benchmark. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, benchmarks wereall chosen to be highly parallel kernels which are suitable <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> the GPU. Am<strong>on</strong>gthe chosen benchmarks, the memory access pattern in four <str<strong>on</strong>g>of</str<strong>on</strong>g> the benchmarks is describableby RCSLMADs and the compiler is able to generate GPU code. In benchmark RPES, loopsare triangular with indirect memory references and the compiler was unable to generateGPU code.A more comprehensive study <str<strong>on</strong>g>of</str<strong>on</strong>g> the percentage <str<strong>on</strong>g>of</str<strong>on</strong>g> cases in which the compiler is ableto generate GPU code will require a standard benchmark suite implemented in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>52

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!