A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

identify loads that can be coalesced.Reducing the number <strong>of</strong> load instructions is only done by jit4GPU when it can findsuitable loads in the loop body to coalesce. Thus, larger loop bodies have more potential <strong>for</strong>load-coalescing. Be<strong>for</strong>e doing load-coalescing trans<strong>for</strong>mations, Jit4GPU carries out a loopunrollingpass. A larger loop body also helps the AMD IL compiler to do better instructionscheduling and register allocation. However, larger loop bodies can cause an increase inregister usage per thread that results in a lower number <strong>of</strong> threads running in parallel onthe GPU. There<strong>for</strong>e, the compiler first uses an heuristic to determine if the register usageis below a certain threshold. Loop trans<strong>for</strong>mations are only done if register usage is belowthe threshold. Estimating register usage can be time consuming but <strong>for</strong>tunately need notbe done by the JIT compiler. Register usage is estimated by un<strong>Python</strong> and it passes thein<strong>for</strong>mation to jit4GPU. If enough registers are available then a loop unroll factor <strong>of</strong> eitherfour or two is applied. The factor four was chosen because in many cases a factor <strong>of</strong> fourunroll gives good opportunities to coalesce loads into float4, the widest data type availableon GPUs. If the loop upper is not divisible by four, then an unroll by two is attemptedinstead. Inner loops are given higher priority <strong>for</strong> unrolling.Under special conditions the jit4GPU also per<strong>for</strong>ms loop fusion. Loop fusion is a looptrans<strong>for</strong>m where two loops with the same loop bounds and no dependence can be fusedinto a single loop where the loop bodies <strong>of</strong> the two loops is concatenated into a single loopbody. Thus, loop fusion produces a single larger loop from two smaller loops. However,loop fusion can only be per<strong>for</strong>med when there is no dependence between loops. Further, ifthere is some code occurring between the two loops, then loop fusion can only be appliedif the intervening code can be safely moved to a location be<strong>for</strong>e the first loop. I havenot implemented dependence analysis in jit4GPU . There<strong>for</strong>e, in general jit4GPU cannotper<strong>for</strong>m loop fusion. However, under some circumstances the dependence checking is notrequired. Let loop L 1 be a parallel loop and let L 2 be a loop contained in the body <strong>of</strong> L 1 .If L 1 is unrolled, then a copy L ′ 2 <strong>of</strong> L 2 is <strong>for</strong>med. There cannot be any dependence betweendifferent iterations <strong>of</strong> L 1 because L 1 is parallel. There<strong>for</strong>e, there cannot be any dependencebetween L 2 and L ′ 2 because L 2 and L ′ 2 are in the body <strong>of</strong> two different iterations <strong>of</strong> L 1 .There<strong>for</strong>e, the compiler can easily fuse the loops L 2 and L ′ 2.Jit4GPU per<strong>for</strong>ms all the loop trans<strong>for</strong>mations described be<strong>for</strong>e generating AMD ILcode. Jit4GPU per<strong>for</strong>ms loop unrolling and limited loop fusion followed by load-coalescingas a means to reduce the number <strong>of</strong> texture unit instructions executed. The overview <strong>of</strong> thetrans<strong>for</strong>mations per<strong>for</strong>med by jit4GPU is summarized in algorithm 6.50
Algorithm 6 per<strong>for</strong>ms loop trans<strong>for</strong>mations <strong>for</strong> GPU code1: <strong>for</strong> all loops with no child loops do2: Mark as candidate <strong>for</strong> unrolling3: end <strong>for</strong>4: <strong>for</strong> all parallel loops do5: Mark as suitable <strong>for</strong> unrolling if children loops have no children.6: end <strong>for</strong>7: while register usage is below threshold do8: Pick the innermost unrolling candidate loop available and unroll.9: Update register usage estimate <strong>of</strong> parent loops.10: Mark the unrolled loop as unsuitable <strong>for</strong> unrolling.11: end while12: Fuse as many loops as possible.13: Per<strong>for</strong>m load-coalescing trans<strong>for</strong>mations.51
Page 3 and 4:
Examining CommitteeJose Nelson Amar
Page 5 and 6:
AbstractModern Graphics Processing
Page 8 and 9:
5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11:
List of Figures2.1
Page 12 and 13: RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15: much lower than equivalent C progra
Page 16 and 17: Chapter 2AMD RV770 architecture <st
Page 18 and 19: Figure 2.1: RV770 Block Diagram. So
Page 20 and 21: 2.2 AMD CAL Programming modelVariou
Page 22 and 23: a context. The predefined names are
Page 24 and 25: 2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51 and 52: Algorithm 1 solves the data transfe
Page 53 and 54: 0 ≤ j < 5 (5.26)In the example, t
Page 55 and 56: case, assume we had a C array start
Page 57 and 58: solve for unknown
Page 59 and 60: approach, the number of</st
Page 61: Chapter 6Loop transfor</str
Page 65 and 66: with suitable GPU code annotations.
Page 67 and 68: Table 7.3: Execution time f
Page 69 and 70: time and the execution time on the
Page 71 and 72: suite. The serial version o
Page 73 and 74: However Shedskin does not support e
Page 75 and 76: thread. Device functions are writte
Page 77 and 78: Chapter 9ConclusionsThis thesis int
Page 79: [14] Francois Labonte, Peter Mattso
show all

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

Create successful ePaper yourself

Delete template?

Save as template?