A Compiler for Parallel Exeuction of Numerical Python Programs on ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

The algorithm is very conservative and transfers all memory locations in the memory intervalspanned by the group. The algorithm is correct but transfers too much data and uses toomuch GPU on-board memory. Better solutions are needed in most cases.Algorithm 3 computes a superset <strong>of</strong> union <strong>of</strong> arbitrary RCSLMADsInputs: A group {L 1 , L 2 , .., L n } <strong>of</strong> RCSLMADs defined over domain D.Outputs: Set M, size S <strong>of</strong> set M and function F <strong>for</strong> mapping CPU memory locations toGPU memory locations.1: Compute m 1 = min(min(L 1 (V ), V ɛD), min(L 2 (V ), V ɛD), .., min(L n (V ), V ɛD)).2: Compute m 2 = max(max(L 1 (V ), V ɛD), max(L 2 (V ), V ɛD), .., max(L n (V ), V ɛD)).3: Compute M = {m|m 1 ≤ m ≤ m 2 , mɛZ}.4: Compute S = m 2 − m 1 .5: Construct F (m) = m − m 1 and R = {m|0 ≤ m ≤ m 2 − m 1 , mɛZ}.6: return M, S, and F .While the algorithms 1, 2 and 3 have only been described <strong>for</strong> RCSLMADs, they areactually applicable on a slightly wider class <strong>of</strong> LMADs. For convenience, assume thatthe ordering function is the identity function and consider the condition p k > p k+1 − 1 +d∑p j ∗ (u j − 1) imposed on RCSLMADs. If the condition is instead loosened to p k >j=k+1d∑j=k+1p j ∗ (u j − 1), then the algorithms 1, 2 and 3 are still valid (and can be verified bysubstituting the loosened condition in the corresponding pro<strong>of</strong>s).5.3 More efficient solutions in specific casesIn some cases, it is possible to derive more efficient solutions that transfer less data thanthe general algorithm presented earlier. Computing and reasoning with unions <strong>of</strong> arbitraryRCSLMADs is non-trivial. Consider a group <strong>of</strong> n RCSLMADs. Let p jk be the stride inthe j-th RCSLMAD in k-th dimension. Instead <strong>of</strong> attempting to compute the union <strong>of</strong>RCSLMADs in arbitrary cases, this section is limited to the case p t1k = p t2k, 1 ≤ t 1 , t 2 ≤ n,i.e. all RCSLMADs have the same stride in any given dimension k. Such cases occur whena programmer is accessing multiple array locations in the loop body with fixed distancebetween the array accesses. All the RCSLMADs must have the same ordering functionbecause all the RCSLMADs share strides and are defined over the same domain. Withoutloss <strong>of</strong> generality, assume that the ordering function is identity throughout this section.One example <strong>of</strong> the types <strong>of</strong> problems being studied in this section is as follows:L 1 = 0 + 20 ∗ i + 3 ∗ j (5.23)L 2 = 21 + 20 ∗ i + 3 ∗ j (5.24)0 ≤ i < 5 (5.25)40
0 ≤ j < 5 (5.26)In the example, the two RCSLMADs share the same strides but have different bases.5.3.1 Representation <strong>of</strong> unionA suitable representation needs to be chosen to represent the union <strong>of</strong> RCSLMADs. Therepresentation chosen influences how accurately the union can be computed. As in any othercompiler analysis, a trade<strong>of</strong>f needs to be made between simplicity and accuracy. One potentialchoice <strong>for</strong> representing unions is a single RCSLMAD. However the set <strong>of</strong> RCSLMADsis not closed over the union operation, i.e. the union <strong>of</strong> two RCSLMADs cannot always berepresented exactly using a single RCSLMAD. Instead, I define a new type <strong>of</strong> set designed torepresent a collection <strong>of</strong> interleaved RCSLMADs with common strides. For brevity, I termsuch groups as ERL <strong>for</strong> Extended Restricted CSLMADs since they extend RCSLMADs tomultiple bases. Formally an ERL E is defined as a set over a d-dimensional domain D asfollows:Definition 9. Let there be n RCSLMADs L 1 , L 2 , ..., L n defined over the same d-dimensionaldomain D with the following constraints:1. Let p t1k and p t2k represents the k-th strides <strong>of</strong> L t1 and L t2 respectively. Then thefollowing condition must be satisfied:∀(1 ≤ t 1 , t 2 ≤ n, 1 ≤ k ≤ d) p t1k = p t2k (5.27)2. Let b x be the base <strong>of</strong> the RCSLMAD L x . Then∀(1 ≤ x, y ≤ n) b x − b y onstraints, ERL E is defined as the union <strong>of</strong> the RCSLMADs L 1 toL n . For d-dimensions and n RCSLMADs, an ERL has 2 ∗ d + n parameters including dupper bounds, d strides and n bases.One example <strong>of</strong> the type <strong>of</strong> sets represented by ERLs is seen by the following example:L 1 = 0 + 20 ∗ i + 3 ∗ j (5.29)L 2 = 1 + 20 ∗ i + 3 ∗ j (5.30)0 ≤ i < 5 (5.31)0 ≤ j < 5 (5.32)As can be verified, the above RCSLMADs can be represented by an ERL since all conditionsare satisfied. The 2 RCSLMADs are interleaved, i.e. they never overlap and have the same41
Page 3 and 4: Examining CommitteeJose Nelson Amar
Page 5 and 6: AbstractModern Graphics Processing
Page 8 and 9: 5.3.3 Multidimensional RCSLMADs wit
Page 10 and 11: List of Figures2.1
Page 12 and 13: RAMRISCSIMDSPSPMDSSETEXTPUVDVLIWbDE
Page 14 and 15: much lower than equivalent C progra
Page 16 and 17: Chapter 2AMD RV770 architecture <st
Page 18 and 19: Figure 2.1: RV770 Block Diagram. So
Page 20 and 21: 2.2 AMD CAL Programming modelVariou
Page 22 and 23: a context. The predefined names are
Page 24 and 25: 2.3 Hardware execution model2.3.1 P
Page 26 and 27: on the texture unit to bring the da
Page 28 and 29: float4 temp1 = i1[i][threadId.y/4];
Page 30: Chapter 3Python an
Page 33 and 34: Unlike C, NumPy does not impose any
Page 35 and 36: Windows. The module initialization
Page 37 and 38: 3.5.1 Type annotationsType declarat
Page 39 and 40: UnPython generates
Page 41 and 42: Python source code
Page 43 and 44: and optimizations detailed later in
Page 45 and 46: Chapter 5Array access analysisArray
Page 47 and 48: LMADs can represent a wide variety
Page 49 and 50: Figure 5.2: Visualization o
Page 51: Algorithm 1 solves the data transfe
Page 55 and 56: case, assume we had a C array start
Page 57 and 58: solve for unknown
Page 59 and 60: approach, the number of</st
Page 61 and 62: Chapter 6Loop transfor</str
Page 63 and 64: Algorithm 6 perfor
Page 65 and 66: with suitable GPU code annotations.
Page 67 and 68: Table 7.3: Execution time f
Page 69 and 70: time and the execution time on the
Page 71 and 72: suite. The serial version o
Page 73 and 74: However Shedskin does not support e
Page 75 and 76: thread. Device functions are writte
Page 77 and 78: Chapter 9ConclusionsThis thesis int
Page 79: [14] Francois Labonte, Peter Mattso

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

Create successful ePaper yourself

Delete template?

Save as template?