A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

the smallest set E because any value <strong>of</strong> b 0 smaller than b 1 will result in either the same ora larger domain D ′ .The methodology is generalized by construction in algorithm 4 <strong>for</strong> n RCSLMADs withb 0 = b 1 . The idea is that if the number <strong>of</strong> non-distinct bases in the construct ERL E is q,and if the upper bound <strong>of</strong> the domain <strong>of</strong> E is t n +u, then the total number <strong>of</strong> elements to betransferred to the GPU is (t n +u)∗q since t n +u elements need to be transferred <strong>for</strong> each nondistinctRCSLMAD present in the ERL E. Thus on the GPU, we can allocate space equalto (t n + u) ∗ q elements. For address mapping, consider the interval b 0 to b 0 + m − 1 on thesystem RAM. Out <strong>of</strong> this interval, q elements accessed by the q non-overlapping RCSLMADsare transferred. This pattern <strong>of</strong> q out <strong>of</strong> every m elements is repeated. There<strong>for</strong>e if elementsnot accessed by the ERL were to be discarded, then the pattern is equivalent to q interleavedaccesses with stride q.Algorithm 4 computes the approximate union <strong>of</strong> n one-dimensional RCSLMADs with thesame stride.Inputs: n one-dimensional RCSLMADs {L 1 , L 2 , .., L n } with stride m and bases{b 1 , b 2 , .., b n } such that b 1 ≤ b 2 ... ≤ b n defined over domain D = {i, 0 ≤ i < u}.Outputs: ERL E and a list <strong>of</strong> n expressions representing the trans<strong>for</strong>med address <strong>of</strong> eachRCSLMAD.1: <strong>for</strong> each RCSLMAD L j do2: Compute the term t j = ⌊(b j − b 1 )/m⌋ and r j = (b j − b 1 )%m.3: end <strong>for</strong>4: Construct a list R = {b 1 + r 1 , b 1 + r 2 , .., b n + r n }.5: Remove any duplicates from R. Let the number <strong>of</strong> elements remaining in R be q.6: Sort R in-place.7: Construct a one-dimensional ERL E with stride m, bases R1 and domain D ′ = {j, 0 ≤j ≤ t n + u}.8: Construct an empty list I.9: <strong>for</strong> each RCSLMAD L j do10: Find x such that R(x) = b j + r j .11: Append the expression (represented as an AST or other compiler IR) q ∗ (i + t j ) + xto I where i is the original loop counter <strong>for</strong> which the analysis is being conducted.12: end <strong>for</strong>13: return E and I.5.3.3 Multidimensional RCSLMADs with common stridesConsider n RCSLMADs {L 1 , L 2 , .., L n } with the common strides P = {p 1 , p 2 , .., p d }. Letthe RCSLMADs be defined over a d dimensional domain D = {(i 1 , i 2 , .., i d ) | 0 ≤ i j
solve <strong>for</strong> unknown parameters <strong>of</strong> E using an integer programming solver. The equations arederived as follows:1. E can be constrained such that the m-th component <strong>of</strong> E must be a superset <strong>of</strong>RCSLMAD L m . The idea is to assume that <strong>for</strong> each V ɛD, there exists a point V ′ ɛD ′such that V ′ = V + {t m1 , t m2 , ..., t mn } and that E(m)(V ′ ) = L m (V ). The values t mkare assumed to be unknown integer constants. The following equations can be stated:d∑d∑b m + p k ∗ i k = b ′ m + p k ∗ (i k + t mk ) (5.51)k=1k=1u ′ 1 ≥ u k + t m1 (5.52)u ′ 2 ≥ u 2 + t m2 (5.53).u ′ d ≥ u d + t md (5.54)t m1 ≥ 0 (5.55)t m2 ≥ 0 (5.56).t mk ≥ 0 (5.57)If suitable integer values are found <strong>for</strong> the unknowns t mk and u ′ kthat satisfy the aboveconstraints, then E is a superset <strong>of</strong> the union <strong>of</strong> the RCSLMADs by construction.2. Each component <strong>of</strong> E must be an RCSLMAD and must there<strong>for</strong>e satisfy constraintsrelating the upper bounds and strides.For each integer k such that 2 ≤ k ≤ dd∑p ′ k ≥ p d + (p j ∗ (u ′ j − 1)) (5.58)j=k+13. From the definition <strong>of</strong> an ERL, the difference between any pair <strong>of</strong> bases (b ′ x, b ′ y shouldbe less than stride p d in the last dimension.For {(x, y)|1 ≤ x ≤ n, 1 ≤ y ≤ n, x ≠ y}b x − b y ≤ p d − 1 (5.59)These inequalities <strong>for</strong>ms a set <strong>of</strong> (n − 1) 2 constraints necessary <strong>for</strong> ensuring that E isan ERL.Thus a total <strong>of</strong> n equality constraints and n ∗ d + (d − 1) + (n − 1) 2 inequality constraintscan be derived <strong>for</strong> a total <strong>of</strong> n + d + n ∗ d unknowns. The unknown variables are u ′ k , b′ m45
Page 3 and 4:
Examining CommitteeJose Nelson Amar
Page 5 and 6: AbstractModern Graphics Processing
Page 8 and 9: 5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11: List of Figures2.1
Page 12 and 13: RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15: much lower than equivalent C progra
Page 16 and 17: Chapter 2AMD RV770 architecture <st
Page 18 and 19: Figure 2.1: RV770 Block Diagram. So
Page 20 and 21: 2.2 AMD CAL Programming modelVariou
Page 22 and 23: a context. The predefined names are
Page 24 and 25: 2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51 and 52: Algorithm 1 solves the data transfe
Page 53 and 54: 0 ≤ j < 5 (5.26)In the example, t
Page 55: case, assume we had a C array start
Page 59 and 60: approach, the number of</st
Page 61 and 62: Chapter 6Loop transfor</str
Page 63 and 64: Algorithm 6 perfor
Page 65 and 66: with suitable GPU code annotations.
Page 67 and 68: Table 7.3: Execution time f
Page 69 and 70: time and the execution time on the
Page 71 and 72: suite. The serial version o
Page 73 and 74: However Shedskin does not support e
Page 75 and 76: thread. Device functions are writte
Page 77 and 78: Chapter 9ConclusionsThis thesis int
Page 79: [14] Francois Labonte, Peter Mattso
show all

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

Create successful ePaper yourself

Delete template?

Save as template?