11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

identify loads that can be coalesced.Reducing the number <str<strong>on</strong>g>of</str<strong>on</strong>g> load instructi<strong>on</strong>s is <strong>on</strong>ly d<strong>on</strong>e by jit4GPU when it can findsuitable loads in the loop body to coalesce. Thus, larger loop bodies have more potential <str<strong>on</strong>g>for</str<strong>on</strong>g>load-coalescing. Be<str<strong>on</strong>g>for</str<strong>on</strong>g>e doing load-coalescing trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, Jit4GPU carries out a loopunrollingpass. A larger loop body also helps the AMD IL compiler to do better instructi<strong>on</strong>scheduling and register allocati<strong>on</strong>. However, larger loop bodies can cause an increase inregister usage per thread that results in a lower number <str<strong>on</strong>g>of</str<strong>on</strong>g> threads running in parallel <strong>on</strong>the GPU. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the compiler first uses an heuristic to determine if the register usageis below a certain threshold. Loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s are <strong>on</strong>ly d<strong>on</strong>e if register usage is belowthe threshold. Estimating register usage can be time c<strong>on</strong>suming but <str<strong>on</strong>g>for</str<strong>on</strong>g>tunately need notbe d<strong>on</strong>e by the JIT compiler. Register usage is estimated by un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> and it passes thein<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> to jit4GPU. If enough registers are available then a loop unroll factor <str<strong>on</strong>g>of</str<strong>on</strong>g> eitherfour or two is applied. The factor four was chosen because in many cases a factor <str<strong>on</strong>g>of</str<strong>on</strong>g> fourunroll gives good opportunities to coalesce loads into float4, the widest data type available<strong>on</strong> GPUs. If the loop upper is not divisible by four, then an unroll by two is attemptedinstead. Inner loops are given higher priority <str<strong>on</strong>g>for</str<strong>on</strong>g> unrolling.Under special c<strong>on</strong>diti<strong>on</strong>s the jit4GPU also per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms loop fusi<strong>on</strong>. Loop fusi<strong>on</strong> is a looptrans<str<strong>on</strong>g>for</str<strong>on</strong>g>m where two loops with the same loop bounds and no dependence can be fusedinto a single loop where the loop bodies <str<strong>on</strong>g>of</str<strong>on</strong>g> the two loops is c<strong>on</strong>catenated into a single loopbody. Thus, loop fusi<strong>on</strong> produces a single larger loop from two smaller loops. However,loop fusi<strong>on</strong> can <strong>on</strong>ly be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med when there is no dependence between loops. Further, ifthere is some code occurring between the two loops, then loop fusi<strong>on</strong> can <strong>on</strong>ly be appliedif the intervening code can be safely moved to a locati<strong>on</strong> be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the first loop. I havenot implemented dependence analysis in jit4GPU . There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, in general jit4GPU cannotper<str<strong>on</strong>g>for</str<strong>on</strong>g>m loop fusi<strong>on</strong>. However, under some circumstances the dependence checking is notrequired. Let loop L 1 be a parallel loop and let L 2 be a loop c<strong>on</strong>tained in the body <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 .If L 1 is unrolled, then a copy L ′ 2 <str<strong>on</strong>g>of</str<strong>on</strong>g> L 2 is <str<strong>on</strong>g>for</str<strong>on</strong>g>med. There cannot be any dependence betweendifferent iterati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 because L 1 is parallel. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, there cannot be any dependencebetween L 2 and L ′ 2 because L 2 and L ′ 2 are in the body <str<strong>on</strong>g>of</str<strong>on</strong>g> two different iterati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> L 1 .There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the compiler can easily fuse the loops L 2 and L ′ 2.Jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms all the loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s described be<str<strong>on</strong>g>for</str<strong>on</strong>g>e generating AMD ILcode. Jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms loop unrolling and limited loop fusi<strong>on</strong> followed by load-coalescingas a means to reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> texture unit instructi<strong>on</strong>s executed. The overview <str<strong>on</strong>g>of</str<strong>on</strong>g> thetrans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by jit4GPU is summarized in algorithm 6.50

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!