11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GPU code that per<str<strong>on</strong>g>for</str<strong>on</strong>g>med twice as fast as ATLAS and over 100 times faster than generatedOpenMP code.The executi<strong>on</strong> times <str<strong>on</strong>g>for</str<strong>on</strong>g> are presented in Table 7.1 <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point and in Table7.3 <str<strong>on</strong>g>for</str<strong>on</strong>g> 64-bit floating point. Each case was repeated 5 times and the minimum, maximumand mean executi<strong>on</strong> time are presented. The speedups obtained using the GPU over ATLASare presented in Table 7.2 and Table 7.4. The speedups are calculated using the minimumexecuti<strong>on</strong> time.Table 7.1: Executi<strong>on</strong> time <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> benchmark <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point(sec<strong>on</strong>ds)Problem Size 1024 2048 4096 6120OpenMP 4 threads min 7.64 68.35 874.42 1213.7max 7.82 68.83 877.98 1219.4mean 7.75 68.67 875.64 1215.14ATLAS BLAS min 0.125 0.684 5.12 17.87max 0.126 0.687 5.14 18.05mean 0.125 0.685 5.13 17.91GPU Total (Opt) min 0.084 0.39 3.00 8.19max 0.087 0.41 3.00 8.34mean 0.085 0.40 3.00 8.27GPU Only (Opt) min 0.025 0.277 1.78 6.86max 0.027 0.288 1.88 7.01mean 0.026 0.282 1.82 6.93GPU Total (No opt) min 0.159 0.99 8.68 28.16max 0.161 1.04 8.84 28.8mean 0.160 1.01 8.73 28.39GPU Only (No opt) min 0.108 0.90 7.46 26.88max 0.110 0.95 7.61 27.52mean 0.109 0.916 7.5 27.11Table 7.2: Speedups <str<strong>on</strong>g>for</str<strong>on</strong>g> matrix multiplicati<strong>on</strong> using GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> 32-bit floating point overATLASProblem Size Speedup No Opt Speedup Opt1024 0.786 1.4882048 0.69 1.754096 0.58 1.76120 0.634 2.187.2 CP benchmarkCP benchmark from the Parboil suite was implemented in <str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g>. The benchmark is asimple nested loop and the top two loops are annotated to be parallel <str<strong>on</strong>g>for</str<strong>on</strong>g> GPU executi<strong>on</strong>.The CP benchmark is representative <str<strong>on</strong>g>of</str<strong>on</strong>g> some computati<strong>on</strong>s d<strong>on</strong>e in molecular dynamics.This benchmark computes the columbic potential at each point in a planar grid. Thecomputati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> potential at each grid point is independant <str<strong>on</strong>g>of</str<strong>on</strong>g> other grid points and there<str<strong>on</strong>g>for</str<strong>on</strong>g>e54

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!