11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

approach, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements transferred is 6 ∗ 5 ∗ 2 = 60 elements. The ERL approachcan result in significant savings in the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements transferred if <strong>on</strong>ly a small tile <str<strong>on</strong>g>of</str<strong>on</strong>g>a large array is accessed in a computati<strong>on</strong>.If the number <str<strong>on</strong>g>of</str<strong>on</strong>g> dimensi<strong>on</strong>s is greater than <strong>on</strong>e, then a feasible soluti<strong>on</strong> does not necessarilyexist. The problem is that we are attempting to interpret the n RCSLMADs as ninterleaved accesses into a d-dimensi<strong>on</strong>al array but such an interpretati<strong>on</strong> does not alwaysexist. If the integer programming solver is called and no feasible soluti<strong>on</strong> is found, then thecompiler falls back to algorithm 3 <str<strong>on</strong>g>for</str<strong>on</strong>g> data transfer analysis.To generate the mapped address <str<strong>on</strong>g>of</str<strong>on</strong>g> each array access, algorithm 5 is used.Algorithm 5 computes the mapped address <str<strong>on</strong>g>for</str<strong>on</strong>g> multidimensi<strong>on</strong>al RCSLMAD problemssolved using the integer linear programming methodInputs: Specified c<strong>on</strong>stants u k , values t mk and b ′ m computed using integer linearprogramming Outputs: A list <str<strong>on</strong>g>of</str<strong>on</strong>g> n expressi<strong>on</strong>s representing the address <strong>on</strong> theGPU.1: C<strong>on</strong>struct a vector R1 = [b ′ 1, b ′ 2, ..., b ′ n].2: C<strong>on</strong>struct a vector R2 = R1.3: Remove all n<strong>on</strong>-duplicate entries <str<strong>on</strong>g>of</str<strong>on</strong>g> R2.4: Sort R2 in-place.5: C<strong>on</strong>struct a vector R3 such that R3[i] = x where R2[x] = R1[i].6: Compute q = length <str<strong>on</strong>g>of</str<strong>on</strong>g> R2.7: <str<strong>on</strong>g>for</str<strong>on</strong>g> each integer k such that 1 ≤ k ≤ d do8: Compute t k = max(t 1k , t 2k , ..., t nk ).9: end <str<strong>on</strong>g>for</str<strong>on</strong>g>10: Compute b 0 = min(R1).11: <str<strong>on</strong>g>for</str<strong>on</strong>g> each RCSLMAD L m dod∑d∏12: C<strong>on</strong>struct an expressi<strong>on</strong> I m = (b ′ m − b 0 ) + q ∗ (i k + t mk ) ∗ (u l + t l ).13: end <str<strong>on</strong>g>for</str<strong>on</strong>g>14: return list {I 1 , I 2 , ..., I n }.k=1l=k+15.4 Loop tiling <str<strong>on</strong>g>for</str<strong>on</strong>g> handling large loopsUsing previously discussed algorithms 3 and 5, the number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements to be transferred tothe GPU can be computed. However, if the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> memory to be transferred does not fit<strong>on</strong> the GPU, then the compiler must attempt to tile the loop. In the implemented compiler,the compiler <strong>on</strong>ly c<strong>on</strong>siders tiling the parallel loops since tiling parallel loops is always legal.Once an acceptable tile size is found, then the tiles are executed sequentially <strong>on</strong> the GPU.The executi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the tiles can be pipelined but this possibility was not c<strong>on</strong>sidered due totime c<strong>on</strong>straints.In the ideal case, the tiling should be such that the total amount <str<strong>on</strong>g>of</str<strong>on</strong>g> data transferredbetween the CPU and GPU should be minimized. However, again the problem is hard andthere<str<strong>on</strong>g>for</str<strong>on</strong>g>e a very basic heuristic was used. The compiler simply reduces the upper bound <str<strong>on</strong>g>of</str<strong>on</strong>g>47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!