11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The algorithm is very c<strong>on</strong>servative and transfers all memory locati<strong>on</strong>s in the memory intervalspanned by the group. The algorithm is correct but transfers too much data and uses toomuch GPU <strong>on</strong>-board memory. Better soluti<strong>on</strong>s are needed in most cases.Algorithm 3 computes a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> arbitrary RCSLMADsInputs: A group {L 1 , L 2 , .., L n } <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs defined over domain D.Outputs: Set M, size S <str<strong>on</strong>g>of</str<strong>on</strong>g> set M and functi<strong>on</strong> F <str<strong>on</strong>g>for</str<strong>on</strong>g> mapping CPU memory locati<strong>on</strong>s toGPU memory locati<strong>on</strong>s.1: Compute m 1 = min(min(L 1 (V ), V ɛD), min(L 2 (V ), V ɛD), .., min(L n (V ), V ɛD)).2: Compute m 2 = max(max(L 1 (V ), V ɛD), max(L 2 (V ), V ɛD), .., max(L n (V ), V ɛD)).3: Compute M = {m|m 1 ≤ m ≤ m 2 , mɛZ}.4: Compute S = m 2 − m 1 .5: C<strong>on</strong>struct F (m) = m − m 1 and R = {m|0 ≤ m ≤ m 2 − m 1 , mɛZ}.6: return M, S, and F .While the algorithms 1, 2 and 3 have <strong>on</strong>ly been described <str<strong>on</strong>g>for</str<strong>on</strong>g> RCSLMADs, they areactually applicable <strong>on</strong> a slightly wider class <str<strong>on</strong>g>of</str<strong>on</strong>g> LMADs. For c<strong>on</strong>venience, assume thatthe ordering functi<strong>on</strong> is the identity functi<strong>on</strong> and c<strong>on</strong>sider the c<strong>on</strong>diti<strong>on</strong> p k > p k+1 − 1 +d∑p j ∗ (u j − 1) imposed <strong>on</strong> RCSLMADs. If the c<strong>on</strong>diti<strong>on</strong> is instead loosened to p k >j=k+1d∑j=k+1p j ∗ (u j − 1), then the algorithms 1, 2 and 3 are still valid (and can be verified bysubstituting the loosened c<strong>on</strong>diti<strong>on</strong> in the corresp<strong>on</strong>ding pro<str<strong>on</strong>g>of</str<strong>on</strong>g>s).5.3 More efficient soluti<strong>on</strong>s in specific casesIn some cases, it is possible to derive more efficient soluti<strong>on</strong>s that transfer less data thanthe general algorithm presented earlier. Computing and reas<strong>on</strong>ing with uni<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> arbitraryRCSLMADs is n<strong>on</strong>-trivial. C<strong>on</strong>sider a group <str<strong>on</strong>g>of</str<strong>on</strong>g> n RCSLMADs. Let p jk be the stride inthe j-th RCSLMAD in k-th dimensi<strong>on</strong>. Instead <str<strong>on</strong>g>of</str<strong>on</strong>g> attempting to compute the uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>RCSLMADs in arbitrary cases, this secti<strong>on</strong> is limited to the case p t1k = p t2k, 1 ≤ t 1 , t 2 ≤ n,i.e. all RCSLMADs have the same stride in any given dimensi<strong>on</strong> k. Such cases occur whena programmer is accessing multiple array locati<strong>on</strong>s in the loop body with fixed distancebetween the array accesses. All the RCSLMADs must have the same ordering functi<strong>on</strong>because all the RCSLMADs share strides and are defined over the same domain. Withoutloss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality, assume that the ordering functi<strong>on</strong> is identity throughout this secti<strong>on</strong>.One example <str<strong>on</strong>g>of</str<strong>on</strong>g> the types <str<strong>on</strong>g>of</str<strong>on</strong>g> problems being studied in this secti<strong>on</strong> is as follows:L 1 = 0 + 20 ∗ i + 3 ∗ j (5.23)L 2 = 21 + 20 ∗ i + 3 ∗ j (5.24)0 ≤ i < 5 (5.25)40

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!