11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

with suitable GPU code annotati<strong>on</strong>s. Implementing such a benchmark suite is left as futurework. However, as menti<strong>on</strong>ed in chapter 5, jit4GPU is <strong>on</strong>ly able to generate GPU code whenthe memory access pattern is describable by RCSLMADs. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, some examples <str<strong>on</strong>g>for</str<strong>on</strong>g>which jit4GPU will be unable to generate a GPU versi<strong>on</strong> includes FFT, c<strong>on</strong>jugate gradientalgorithms and matrix solvers with triangular loops.The important results from the experiments are:1. Using a GPU delivered upto 100 times speedup over generated OpenMP code running<strong>on</strong> the CPU.2. Loop optimizati<strong>on</strong>s per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by jit4GPU deliver upto four times per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance improvement<strong>on</strong> the GPU.3. Benchmarks that per<str<strong>on</strong>g>for</str<strong>on</strong>g>m very little computati<strong>on</strong> per data item accessed are not suitable<str<strong>on</strong>g>for</str<strong>on</strong>g> the GPU because the data transfer overhead is larger than the computati<strong>on</strong>time in such cases.All experiments were d<strong>on</strong>e using a AMD Phenom X4 9550 (2.2ghz quad-core) pairedwith a Rade<strong>on</strong> HD 4870 and 4gb <str<strong>on</strong>g>of</str<strong>on</strong>g> RAM. Frequency scaling was kept disabled <strong>on</strong> the CPU.The operating system was Ubuntu 8.10 with Linux kernel versi<strong>on</strong> 2.6.27-7 and GCC versi<strong>on</strong>4.3.2. For compiling C++ code, the optimizati<strong>on</strong> flag -O3 was passed to the CPU. Whencompiling <str<strong>on</strong>g>for</str<strong>on</strong>g> OpenMP, the flag -fopenmp was also passed. Several other flags were alsotested but they produced no notable per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance changes and were exluded in the resultspresented here.Each experiment was repeated 5 times and the minimum, maximum and mean executi<strong>on</strong>times are presented.7.1 Matrix multiplicati<strong>on</strong>Matrix multiplicati<strong>on</strong> was implemeneted <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit and 64-bit floating point matrices. Avery simple implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> was d<strong>on</strong>e in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and the outer twoloops were marked as parallel loops <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU executi<strong>on</strong>. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance was studied againstthe matrix sizes. For comparing per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> the generated GPU code against the CPU,the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance results <str<strong>on</strong>g>of</str<strong>on</strong>g> the generated OpenMP code as well as per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance results fromATLAS library are included. ATLAS implements a tiled matrix multiplicati<strong>on</strong> algorithmand also autotunes to best fit the system CPU at the time <str<strong>on</strong>g>of</str<strong>on</strong>g> installati<strong>on</strong>. ATLAS is a highper<str<strong>on</strong>g>for</str<strong>on</strong>g>mance library with many years <str<strong>on</strong>g>of</str<strong>on</strong>g> development ef<str<strong>on</strong>g>for</str<strong>on</strong>g>t. By comparisi<strong>on</strong>, the <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>implementati<strong>on</strong> was a straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> written inunder ten lines <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> code. From this <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> source, the compiler was able to generate53

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!