12.07.2015 Views

Tiling Stencil Computations to Maximize Parallelism - Computer ...

Tiling Stencil Computations to Maximize Parallelism - Computer ...

Tiling Stencil Computations to Maximize Parallelism - Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Tiling</strong> <strong>Stencil</strong> <strong>Computations</strong> <strong>to</strong> <strong>Maximize</strong><strong>Parallelism</strong>Vinayaka Bandishti, Irshad Pananilath, and Uday BondhugulaDepartment of <strong>Computer</strong> Science and Au<strong>to</strong>mationIndian Institute of Science, Bangalore 560012{vbandishti,irshad,uday}@csa.iisc.ernet.inAbstract—Most stencil computations allow tile-wise concurrentstart, i.e., there always exists a face of the iteration space and aset of tiling hyperplanes such that all tiles along that face can bestarted concurrently. This provides load balance and maximizesparallelism. However, existing au<strong>to</strong>matic tiling frameworks oftenchoose hyperplanes that lead <strong>to</strong> pipelined start-up and loadimbalance. We address this issue with a new tiling techniquethat ensures concurrent start-up as well as perfect load-balancewhenever possible. We first provide necessary and sufficientconditions on tiling hyperplanes <strong>to</strong> enable concurrent start forprograms with affine data accesses. We then provide an approach<strong>to</strong> find such hyperplanes. Experimental evaluation on a 12-coreIntel Westmere shows that our code is able <strong>to</strong> outperform atuned domain-specific stencil code genera<strong>to</strong>r by 4% <strong>to</strong> 27%, andprevious compiler techniques by a fac<strong>to</strong>r of 2× <strong>to</strong> 10.14×.Index Terms—Compilers, Program transformationI. INTRODUCTION AND MOTIVATION<strong>Stencil</strong>s are a very common class of programs appearingin many scientific and engineering applications that arecomputationally intensive. They are characterized by regularcomputational structure and hence allow au<strong>to</strong>matic compiletimeanalysis and transformation for exploiting data-localityand parallelism.<strong>Stencil</strong> computations are characterized by update of a gridpoint using neighboring points. Figure 1 shows a stencil overa one-dimensional data space used <strong>to</strong> model, for example,a 1-d heat equation. They exhibit a number of propertiesthat lend themselves <strong>to</strong> optimization. Locality optimizationand parallelization are the most important among them. Looptiling [1], [36], [38] is a key transformation used <strong>to</strong> exploitdata locality and parallelism from stencil computations.for ( t = 1; t


provides background and introduces notation used. Section IIIcharacterizes concurrent start conditions. Section IV describesour approach <strong>to</strong> find solutions with the desired properties.Section V presents experimental evaluation and conclusionsare presented in Section VII.II. BACKGROUND AND NOTATIONA. Outer parallelism and concurrent startAchieving concurrent start in iteration spaces that have atleast one degree of outer parallelism or communication-freeparallelism is fairly simple. Finding concurrent start becomescomplicated when dependences span the entire iteration space.In this paper, since we consider stencils, we are looking atprograms which have no outer parallelism. However, there stillexists opportunity <strong>to</strong> start concurrently along at least one faceof the iteration space.B. Characterizing a stencil programWe assume and exploit the following inherent properties ofstencil computations.• As a grid point is always updated using values fromneighboring grid-points in recent time steps, the loopnests always have a ‘time-step’ loop as the outermostloop.• Entire space-time grid of d +1 dimensions uses an arrayof d dimensions, outermost index representing differenttime steps.• They can always be written in single statement form as:value p [t + 1] = f(value p [t], value neighbors of(p) [t], . . .)C. Conic and strict conic combinationA conic combination of a set of vec<strong>to</strong>rs ⃗x 1 , ⃗x 2 , . . . , ⃗x n is avec<strong>to</strong>r of the formλ 1 ⃗x 1 + λ 2 ⃗x 2 + · · · + λ n ⃗x n (1)λ i ≥ 0.If all λs are strictly positive, we call (1) a strict coniccombination of ⃗x 1 , ⃗x 2 , . . ., ⃗x n .D. Dependences and tiling hyperplanesLet S 1 , S 2 , . . . , S n be the statements of a program, andlet S = {S 1 , S 2 , . . .,S n }. The iteration space of a statementcan be represented as a polyhedron. The dimensions of thepolyhedron correspond <strong>to</strong> surrounding loop itera<strong>to</strong>rs as wellas program parameters. Program parameters are symbols thatdo not vary in the portion of the code we are representing;they are typically the problem sizes. Each integer point inthe polyhedron, also called an iteration vec<strong>to</strong>r, contains valuesfor induction variables of loops surrounding the statementfrom outermost <strong>to</strong> innermost. The data dependence graph,DDG = (S, E) is a directed multi-graph with each vertexrepresenting a statement in the program and edge e from S i<strong>to</strong> S j representing a polyhedral dependence from a dynamicinstance of S i <strong>to</strong> one of S j . Every edge e is characterizedby a polyhedron P e , called dependence polyhedron whichprecisely captures all the dependences between the dynamicinstances of S i and S j . One can obtain a less powerfulrepresentation such as a constant distance vec<strong>to</strong>r or directionvec<strong>to</strong>r from dependence polyhedra by analyzing the relationbetween source and target itera<strong>to</strong>rs. For all examples in thepaper, we use constant distance vec<strong>to</strong>rs for simplicity. Thereader is referred <strong>to</strong> [4] for a more detailed explanation of thepolyhedral representation.A hyperplane is an n −1 dimensional affine subspace of ann dimensional space. For example, any line is a hyperplanein a 2-d space, and any 2-d plane is a hyperplane in 3-dspace. A hyperplane for a statement S i is of the form:φ Si (⃗x) = ⃗ h·⃗x + h 0 (2)where h 0 is the translation or the constant shift component,and ⃗x is an iteration of S i . ⃗ h itself can be viewed as a vec<strong>to</strong>roriented in a direction normal <strong>to</strong> the hyperplane.Prior research [24], [5] provides conditions for a hyperplane<strong>to</strong> be a valid tiling hyperplane. For φ s1 , φ s2 , . . .,φ sk <strong>to</strong> bevalid statement-wise tiling hyperplanes for S 1 , S 2, , . . .,S krespectively, the following should hold for each edge e fromS i <strong>to</strong> S j :φ sj (⃗t) − φ si (⃗s) ≥ 0, 〈⃗s,⃗t〉 ∈ P e , ∀e ∈ E (3)The above constraint implies that all dependences have nonnegativecomponents along each of the hyperplanes, i.e., theirprojections on the these hyperplane normals are never in adirection opposite <strong>to</strong> that of the hyperplane normals.In addition, the set of tiling hyperplanes should be linearlyindependent of each other. Each statement has as many linearlyindependent tiling hyperplanes as its loop nest dimensionality.Among the many possible hyperplanes, the optimal solutionaccording <strong>to</strong> a cost function is chosen. A cost function thathas worked in the past is based on minimizing dependencedistances lexicographically with hyperplanes being foundfrom outermost <strong>to</strong> innermost [5].If all iteration spaces are bounded, there exists an affine functionv(⃗p) = u· ⃗p + w that bounds δ e (t) for every dependenceedge e:v(⃗p) − (φ si (⃗t) − φ sj (⃗s)) ≥ 0, 〈⃗s,⃗t〉 ∈ P e , ∀e ∈ E (4)where ⃗p is a vec<strong>to</strong>r of program parameters, i.e., symbolsappearing in loop bounds and accesses; these often representproblem sizes. The coefficients u, w are then minimized.This ensures the following:• In the transformed space, all dependences have nonnegativecomponents along all bases.• Any rectangular tiling in the transformed space is valid.for ( t=0; t


(1, −2)(2, −1)(1, 0)(2, 1)(1, 2)eliminating the columns producing translation and any otherrows meant <strong>to</strong> specify loop distribution at any level [21],[10], [5]. As the inter-tile dependences are all intra-statementand are not affected by the translation components of tilinghyperplanes, we have:Fig. 3. d1 ⃗ = (1, 2), d2 ⃗ = (1, 0), and d ⃗ 3 = (1, −2) are the dependences.For any hyperplane φ ⃗ in the cone formed by (2,1) and (2,-1), φ· ⃗ d 1 ≥ 0 ∧⃗φ· d 2 ≥ 0 ∧ φ· ⃗ d 3 ≥ 0 hold true. Cost function chooses (1,0) and (2,1) astiling hyperplanes.Example: Consider the program in Figure 2. As all thedependences are self dependences and also uniform, they canbe represented by constant vec<strong>to</strong>rs. The dependences are:( ) ( )⃗d1 d2 ⃗ d3 ⃗ 1 1 1=−2 0 2Equation 3 can now be written as( ) 1 1 1φ·≥ ( 0 0 0 ) (5)−2 0 2Equation 5 forces the chosen hyperplane <strong>to</strong> have nonnegativecomponents along all dependences. Therefore, anyhyperplane in the cone of (2,-1) and (2,1) is a valid one(Figure 3). However, cost function (4) ends up choosing (1,0)and (2,1)E. Rectangular tiling and inter-tile dependencesAs mentioned earlier, once the transformation is applied,the transformed space can be tiled rectangularly. Hence, byassuming that inter-tile dependences in the transformed spaceare unit vec<strong>to</strong>rs along all the bases, we can be sure that anyinter-tile dependence is satisfied (Figure 4). It is important<strong>to</strong> note here that this approximation is actually accurate forstencils i.e., it is also necessary that we consider inter-tiledependences along every base as the dependences span theentire iteration space (there is no outer parallelism). In therest of the paper we refer <strong>to</strong> these safely approximated intertiledependences as simply inter-tile dependences. Thus ifF ′ is the matrix whose columns are the approximated intertiledependences in the transformed space, F ′ will be a unitmatrix.t 1 = t(1, 0)(0, 1)t 2 = 2t + iFig. 4. Transformed iteration space after applying (1,0) and (2,1) as thetiling hyperplanes. Approximating unit inter-tile dependences along every baseaccounts for any actual inter-tile dependence.Let T R be the reduced transformation matrix which has onlythe ⃗ h components (Eqn. 2) of all hyperplanes as rows. T R ofany statement is thus a square matrix which can be obtained byF ′ = T R· Fwhere F corresponds <strong>to</strong> approximate inter-tile dependences inoriginal iteration space.As mentioned earlier, F ′ can be safely considered a unitmatrix. Therefore,Example:F = T −1RThe tiling dependences introduced by hyperplanes (1,0) and(2,1) can be found as follows:( )( ) −11 0⃗c1 ⃗c 2 = =2 1( ) 1 0−2 1Thus, the inter-tile dependences are (1,-2) and (0,1). Thetiling dependences introduced in the iteration space of astatement solely depend on the hyperplanes chosen for thatstatement. Also, note that every inter-tile dependence, giventhe way we defined it, is carried by only one of the tilinghyperplanes and has a zero component along any other tilinghyperplane.c 1(1, −2)(0, −1)h 1(1, 0)h 2(2, 1)c 2(0, 1)(−1, 2)Fig. 5. Inter-tile dependences for the example in Figure 2 introduced by thehyperplanes (1,0) and (2,1)Consider Figure 5. The tiling hyperplanes in the figureare (1,0) and (2,1). Let us find the inter-tile dependencecarried by (1, 0). There are only two unit vec<strong>to</strong>rs in theorthogonal subspace of the other hyperplane (2, 1). The onlytwo dependences which have zero component along (2,1) are(1,-2) and (-1,2). Among these two, the one which is satisfiedby (1,0) is (1,-2). So, the inter-tile dependence satisfied by(1,0) is (1,-2). Similarly, the dependence satisfied by (2,1) is(0,1).In the above example, concurrent start along (1,0) is preventedby the tile dependence (0,1). There does not exist a facealong which the tiles can start in parallel (Figure 6), i.e., therewould be a pipelined start-up and drain phase. Decreasing thetile size would decrease the start-up and drain phase but wouldincrease the frequency of synchronization. Increasing the tile


size would mean a shorter steady-state. In summary concurrentstart-up is lost.Transform should be found in such a way that it does notintroduce any inter-tile dependence that prohibits concurrentstart. If we had chosen (2,-1), which is also a valid, instead of(1,0) as tiling hyperplane, i.e., (2,-1) and (2,1) were chosenas the tiling hyperplanes, then the inter-tile dependencesintroduced would be (1,-2) and (1,2) (Figure 7). Both havepositive components along the normal (1,0), i.e., all tiles alongthe normal (1,0) could be started concurrently.In the next section, we provide conditions on hyperplanes<strong>to</strong> avoid (1,0) and select (2,-1) instead.III. TILING FOR CONCURRENT STARTIf all the iterations along a face can be started concurrently,the face is said <strong>to</strong> allow point-wise concurrent start. Similarly,if all the tiles along a face can be started concurrently, theface is said <strong>to</strong> allow tile-wise concurrent start. Throughout thepaper, a face of an iteration space is referred by its normal⃗f. In this section, we provide conditions for which tilinghyperplanes of any statement allow concurrent start along agiven face of its iteration space. Theorem 1 introduces theconstraints in terms of inter-tile dependences. Theorem 2 andTheorem 3 map Theorem 1 from inter-tile dependences on<strong>to</strong>tiling hyperplanes. We also prove that these constraints areboth necessary and sufficient for concurrent start.A. Constraints for concurrent startTheorem 1: For a statement, a transformation enables tilewiseconcurrent start along a face ⃗ f iff the tile schedule isin the same direction as the face and carries all inter-tiledependences.Let t o be the outer tile schedule. If ⃗ f is the face allowingconcurrent start and C is the matrix containing approximateinter-tile dependences of the original iteration space, then,Hence,k 1⃗t o = k 2⃗ f, k1 , k 2 ∈ Z +⃗t o· C ≥ ⃗1⃗f· C ≥ ⃗1In Figure 8, the face allowing the concurrent start and outerFig. 8.t 4t 3Outer tile schedulet 2t 1Tile schedule and normal <strong>to</strong> the face should be in the same direction.tile-schedule are the same and carry both the tile-dependences.Therefore, the tiles t 1 , t 2 , t 3 , t 4 can be started concurrently.Theorem 2: Concurrent start along a face ⃗ f can be exposedby a set of hyperplanes iff ⃗ f lies strictly inside the cone formedby the hyperplanes, i.e., iff ⃗ f is a strict conic combination ofall the hyperplanes.k ⃗ f = λ 1⃗ h1 + λ 2⃗ h2 + · · · + λ n⃗ hn (6)λ i , k ∈ Z +Proof (sufficient): As mentioned earlier every inter-tile dependenceis satisfied by only one hyperplane and has zero componentalong every other hyperplane. Consider the followingexpression :λ 1 ( ⃗ h 1·⃗c) + λ 2 ( ⃗ h 2·⃗c) + · · · + λ n ( ⃗ h n·⃗c) (7)For any tile-dependence ⃗c, as we constrain all the λs <strong>to</strong> bestrictly positive, (7) will always be positive. Thus, by choosingall ⃗ h such thatk ⃗ f = λ 1⃗ h1 + λ 2⃗ h2 + · · · + λ n⃗ hnwe ensure f·⃗c ⃗ ≥ 1 for all inter-tile dependences. Therefore,from Theorem 1, concurrent start is enabled. □Proof (necessary): Let us assume that we have concurrentstart along the face f, ⃗ but f ⃗ does not strictly lie inside thecone formed by the hyperplanes, i.e., (6) does not hold good.Without loss of generality, we can assume that k ∈ Z + ,but there exist no λs which are all strictly positive integers.For at least one inter-tile dependence ⃗c , the sum (7) ≤ 0because every tile dependence is carried by only one ofthe chosen hyperplanes and there exists at least one λ suchthat λ ≤ 0. Therefore, f·⃗c ⃗ will also be zero or negative,which means concurrent start is inhibited along f. ⃗ This is acontradiction. □For the face ⃗ f that allows concurrent start, let ⃗ f′be itscounterpart in the transformed space. We now provide analternative result for concurrent start that is equivalent <strong>to</strong> theone above.Theorem 3: A transformation T allows concurrent startalong ⃗ f iff ⃗ f· T −1R ≥ ⃗1 .Proof: From Theorem 1 we have, for the transformed space,the condition for concurrent start becomes⃗f ′· C ′ ≥ ⃗1 (8)We know that in the transformed space, we approximate allinter-tile dependences by unit vec<strong>to</strong>rs along every base. Sincethe normals are not affected by translations, the normal f ⃗′after transformation T is given by f· ⃗ T −1R . Also, C′ is a unitmatrix.Therefore, (8) becomes⃗f· T −1R≥ ⃗1


inter-tile dependenceiteration dependenceinter-tile dependenceiteration dependence(1, -2)(1, -2)(1, 2)(0, 1)Fig. 6.Pipelined start-upFig. 7.Concurrent start-upThe above can be viewed in a different manner. We know thatthe inter-tile dependences introduced in the original iterationspace can be approximated asFrom Theorem 1 we have,B. Partial concurrent startC = T −1R⃗f· T −1R ≥ ⃗1 □By making the outer tile schedule parallel <strong>to</strong> the faceallowing concurrent start (as discussed in the previous section),one can obtain n − 1 degrees of concurrent start, i.e., all tileson the face can be started. But, in practice, exploiting all ofthese degrees may result in more complex code. The aboveconditions can be placed only on the first few hyperplanes <strong>to</strong>obtain what we term partial concurrent start. For instance, theconstraints can be placed only on the first two hyperplanesso that one degree of concurrent start is exploited. Suchtransformations may not only generate better code owing <strong>to</strong>prefetching and vec<strong>to</strong>rization, but also achieve coarser grainedparallelization.C. The case of multiple statementsIn case of multiple statements, every strongly connectedcomponent in the dependence graph can have only one outertile schedule. So, all statements which have <strong>to</strong> be startedconcurrently and <strong>to</strong>gether should have the same face thatallows concurrent start. If they do not have the same faceallowing concurrent start, one of those has <strong>to</strong> be chosen forthe outer tile schedule. These are beyond the scope of thispaper, since in the case of stencils, the following are alwaystrue:• Every statement’s iteration space has the same dimensionalityand same face ⃗ f that allows concurrent start.• Transitive self dependences are all carried by this face.For example, for the stencil in Figure 1 (an imperfect nestwith 2 statements), hyperplanes found by our scheme for S1are (2,1, 0) and (2,-1, 0) and those for S2 are (2,1, 1) and(2,-1, 1) i.e., these correspond <strong>to</strong> 2t+i, 2t-i for S1, and 2t+i+1,and 2t-i+1 for S2.IV. APPROACH FOR FINDING HYPERPLANES ITERATIVELYWe now present a scheme that is an extension <strong>to</strong> the Plu<strong>to</strong>algorithm [5] so that hyperplanes that satisfy properties introducedin the previous section are found. Though Theorem 2and Theorem 3 are equivalent <strong>to</strong> each other, we use Theorem 2in our scheme as it provides simple linear inequalities and iseasy <strong>to</strong> implement.A. Iterative schemeAlong with constraints imposed <strong>to</strong> respect dependencesand encode an objective function, the following additionalconstraints are added while finding hyperplanes iteratively.For a given statement, let ⃗ f be the face along which wewould like <strong>to</strong> start concurrently, H be the set of hyperplanesalready found, and n be the dimensionality of the iterationspace (same as the number of hyperplanes <strong>to</strong> find). Then, weadd the following additional constraints:1) For the first n−1 hyperplanes, ⃗ h is linearly independen<strong>to</strong>f ⃗ f ∪ H, as opposed <strong>to</strong> just H.2) For the last hyperplane ⃗ h n , ⃗ hn is strictly inside thecone formed by ⃗ f and the negatives of the alreadyfound n − 1 hyperplanes, i.e.,k ⃗ h n = λ ′ ⃗ f +λ1 (− ⃗ h 1 )+λ 2 (− ⃗ h 2 )+· · ·+λ n−1 (− ⃗ h n−1 ),(9)λ i , k ∈ Z +In Algorithm 1, we show only our additions <strong>to</strong> the iterativealgorithm proposed in [5].If it is not possible <strong>to</strong> find hyperplanes with these additionalconstraints, we report that tile-wise concurrent start is notpossible. If there exists a set of hyperplanes that exposestile-wise concurrent start, we now prove that the abovealgorithm will find it.Proof: (soundness)When the algorithm returns a set of hyperplanes, we canbe sure that all of them are independent of each other andthat they also satisfy Equation (9) which is obtained by justrearranging the terms of Equation (6). Trivially, our scheme


Algorithm 1 Finding tiling hyperplanes that allow concurrentstart1: Initialize H = ∅2: for n − 1 times do3: Build constraints <strong>to</strong> preserve dependences.4: Add constraints such that the hyperplane <strong>to</strong> be found is linearlyindependent of ⃗ f ∪ H5: Add cost function constraints and minimize cost function for theoptimal solution.6: H = ⃗ h ∪ H7: end for8: Add constraint (9) for the last hyperplane so that it strictly lies insidethe cone of the face and negatives of the n − 1 hyperplanes alreadyfound (H)of finding the hyperplanes is sound, i.e., whenever our schemegives a set of hyperplanes as output, the output is alwayscorrect. □Proof: (completeness)We prove by contradiction that whenever our scheme reportsno solution, there does not exist any valid set of hyperplanes.Suppose there exists a valid set of hyperplanes but our schemefails <strong>to</strong> find them. The scheme can fail at two points.Case 1: While finding the first n − 1 hyperplanesThe only constraint the first n − 1 hyperplanes have is ofbeing linearly independent of one another, and of the face thatallows concurrent start. If there exist n linearly independentvalid tiling hyperplanes for this computation, since the facewith concurrent start f ⃗ is one feasible choice, there existn − 1 tiling hyperplanes linearly independent of f. ⃗ Hence,our algorithm does not fail at this step if the original Plu<strong>to</strong>algorithm [5] is able <strong>to</strong> find n linearly independent ones.Case 2: While finding the the last hyperplaneLet us assume that our scheme fails while trying <strong>to</strong> findthe last hyperplane h ⃗ n . This implies that for any hyperplanestrictly inside the cone formed by the face allowing concurrentstart and the negatives of the already found hyperplanes, thereexists a dependence which makes the tiling hyperplane aninvalid one, i.e., there exists a dependence distance d suchthat⃗hn· ⃗d ≤ −1From Equation (9), we havek ⃗ h n· ⃗d = λ ′ ( ⃗ f· ⃗d) + λ 1 (− ⃗ h 1· ⃗d) + λ 2 (− ⃗ h 2· ⃗d) + . . .+λ n−1 (− ⃗ h n−1· ⃗d) (10)Consider RHS of Equation (10). As all hyperplanes arevalid, ( ⃗ h i· ⃗d) ≥ 0 and λ i s are all positive, all terms except forthe first one are negative. For any stencil, the face allowingconcurrent start always carries all dependences.⃗f· ⃗d ≥ 1, ∀ ⃗ dIf dependences are all constant distance vec<strong>to</strong>rs, all λ i (− ⃗ h· ⃗d)terms are independent of program parameters. So, we couldalways choose λ ′ large enough and λ i s small so that ⃗ h n· ⃗d ≥ 1for any dependence ⃗ d.In the general case of affine dependences, our algorithmcan fail <strong>to</strong> find h ⃗ n in only two cases. The first when it is notstatically possible <strong>to</strong> choose λ i s. This can happen when bothof the following conditions are true:• Dependences are non-uniform• Along the face allowing concurrent start, dependenceshave components that depend on program parameters.In this case, only point-wise concurrent start is possible, buttile-wise concurrent start is not possible (Figure 9). In orderfor our algorithm <strong>to</strong> succeed, one of the λs would have <strong>to</strong>depend on a program parameter (typically the problem size)and this is not admissible.The second case is the expected one that does not evenallow point-wise concurrent start. Here, f ⃗ does not carry alldependences (Figure 10). □iiN • • • • •321• • • • •• • • • •• • • • •• • • • •0 1 2 3Fig. 9. No tile-wise concurrentstart possibleB. ExampleNjN • • • • •321• • • • •• • • • •• • • • •• • • • •0 1 2 3Fig. 10. Even point-wise tile-wiseconcurrent start is not possibleWe provide an example showing hyperplanes found by ourscheme for a 3-d stencil. For simplicity, we show the memoryinefficientversion (with a data dimension for the time itera<strong>to</strong>r)of the 2-d heat stencil in Figure 11.for ( t = 0; t < T; t++) {for ( i = 1; i < N+1; i++) {for ( j = 1; j < N+1; j++) {A[t+1][i ][ j ]= 0.125 ∗(A[t][ i+1][j ] −2.0∗A[t][i ][ j ] +A[t][ i−1][j])+ 0.125 ∗(A[t][ i ][ j+1] −2.0∗A[t][i ][ j ] +A[t][ i ][ j−1])+ A[t][ i ][ j ];}}}Fig. 11.2d-heat stencil (representative version)The transformation computed by Algorithm 1 for full concurrentstart is:⎡⎣ 1 1 0 ⎤1 0 1 ⎦1 −1 −1Figure 12 shows the shape of a tile formed by the abovetransformation.For partial concurrent start, the transformation computed byAlgorithm 1 is:Nj


A. BenchmarksBenchmark1d-heat2d-heat3d-heatgame-of-lifeapop3d7ptProblem Size1600000x100016000 2 x1000150 3 x10016000x500160 3 x100160 3 x100TABLE IIPROBLEM SIZES FOR THE BENCHMARKSFig. 12. Tile shape for 2d-heat stencil obtained by our algorithm. The arrowsrepresent hyperplanes.⎡⎣ 1 1 0⎤1 −1 0⎦1 0 1This enables concurrent start in only one dimension i.e., notall the tiles along the face allowing concurrent start, but alltiles along one edge of the face can be started concurrently.V. EXPERIMENTAL EVALUATIONWe implement our approach on <strong>to</strong>p of the publicly availablesource-<strong>to</strong>-source polyhedral <strong>to</strong>ol chain: Plu<strong>to</strong> [26]. It itself usesthe Cloog [9] library for code generation, and PIP [25] <strong>to</strong>solve for coefficients of hyperplanes. PrimeTile [16] is used<strong>to</strong> perform unroll-jam on Plu<strong>to</strong> generated code. We comparethe performance of our system with Plu<strong>to</strong> serving as the stateof-the-artfrom the compiler works, and the Pochoir stencilcompiler [34] representative of state-of-the-art among domainspecificworks.We use two hardware configurations as shown in Table I. Allbenchmarks use double-precision floating-point computation.icc is used <strong>to</strong> compile all codes with options “-O3 -fp-modelprecise”; hence, only value-safe optimizations are performed.The optimal tile sizes and unroll fac<strong>to</strong>rs are determinedempirically with a limited amount of search. The problem sizesfor various benchmarks are shown in Table II.Intel Xeon E5645 AMD Opteron 6136Microarchitecture Westmere-EP Magny-CoursClock 2.4 GHz 2.4 GHzCores / socket 6 8Total cores 12 16L1 cache / Core 32 KB 128 KBL2 cache / Core 512 KB 512 KBL3 cache / Socket 12 MB 12 MBPeak GFLOPs 57.6 153.6Compiler icc 12.1.3 icc 12.1.3Compiler flags -O3 -fp-model precise -O3 -fp-model preciseLinux kernel 2.6.32 2.6.35TABLE IDETAILS OF ARCHITECTURES USED FOR EXPERIMENTS• Heat 1/2/3D: Heat equations are examples of symmetricstencils. We evaluate the performance of discretized 1D,2D and 3D heat equation stencils with non-periodicboundary conditions. Heat-1D is a 3-point stencil, whileheat 2D and heat 3d are 5-point and 7-point stencils.• Game of Life: Conway’s Game of Life [13] is an 8-point stencil where the state of each point in the nexttime iteration depends on its 8 neighbors. We considera particular version of the game called B2S23, where apoint is “born” if it has exactly two live neighbors anda point survives the current stage if it has exactly eithertwo or three neighbors.• 3d7pt: An order-1 3D 7-point stencil [11] from theBerkeley au<strong>to</strong>-tuner framework.• APOP: APOP [18] is a one dimensional 3-point stencilthat calculates the price of the American put s<strong>to</strong>ck option.B. Significance of concurrent startIn this subsection, we demonstrate the importance of ourenhancement over existing schemes quantitatively. Thoughpipeline drain and startup can be minimized by tweakingtile sizes (besides making sure there are enough tiles in thewavefront), this could constrain tile sizes leading <strong>to</strong> lossof locality and/or higher frequency of synchronization. Inaddition, all of this is highly problem-size dependent. It istrue that for some problem sizes one can be fortunate enough<strong>to</strong> find the right tile sizes such that the effect of pipeline startup/drainis very low, but the diamond tiling approach does notsuffer from this constraint. Figure 13 demonstrates how theperformance, load-balance and scale-up provided by diamondtiling remain unaffected for various problem sizes.In previous scheme for a reasonable problem size, <strong>to</strong> makesure there are enough tiles in the wavefront and also <strong>to</strong> saturatethe pipeline early, we might be forced <strong>to</strong> choose a tile-sizewhich is not the best. But, in our scheme, any tile-size can bechosen.Further, hyperplanes found by our scheme cannot be betterthan those found by Plu<strong>to</strong> with respect <strong>to</strong> the cost function.Therefore, single-thread performance of code generated by ourscheme may reduce. However, by using hyperplanes foundby our scheme <strong>to</strong> only demarcate tiles and the hyperplanesthat would be found by the Plu<strong>to</strong> algorithm without ourenhancement <strong>to</strong> scan points inside a tile, we obtain desiredbenefits for intra-tile execution.


GFLOPs5040302010Ourplu<strong>to</strong>-currentpochoirGFLOPs403530252015105Ourplu<strong>to</strong>-currentpochoir00 2 4 6 8 10 12Number of cores00 2 4 6 8 10 12 14 16Number of cores(a) Heat 2D (Intel)(b) Heat 2D (AMD Opteron)Fig. 15.Heat 2D50plu<strong>to</strong>-64x64x64plu<strong>to</strong>-16x16x16our-64x64x6450Ourplu<strong>to</strong>-currentpochoir4040GFLOPs3020GFLOPs3020101000 2 4 6 8 10 12Number of cores00 2 4 6 8 10 12Number of coresFig. 13. Heat-2D running on Intel machine for a problem size of1600*1600*500 the best tile size for locality being 64*64*64C. ResultsOur scheme consistently outperforms Plu<strong>to</strong> for all benchmarks.We perform better than Pochoir on Heat 1D, Heat2D, Heat 3D, 3d7pt and perform comparably with Pochoirfor APOP and Game of Life. The running times for differentbenchmarks and the speedup fac<strong>to</strong>rs we get over other schemesare presented in Table III. The speedup fac<strong>to</strong>rs reported arewhen running all of them on 12 cores. “icc-par” is icc with“-parallel” flag added <strong>to</strong> the compiler flags listed in Table I.Heat 1D shows the effect of concurrent start for tilingenabled by our scheme. Figure 14 shows the load balancewe achieve with increasing core count. Plu<strong>to</strong> generated codeperforms at around 2 GFLOPs and does not scale beyondtwo cores. This is a result of both: pipeline startup/draintime, and an insufficient number of tiles in the wavefrontnecessary <strong>to</strong> keep all processors busy. Using our new tilinghyperplanes, one is able <strong>to</strong> distribute iterations corresponding<strong>to</strong> the entire data space equally among all cores. The sameFig. 14.Heat 1Dmaximal amount of work is done by processors between twosynchronizations as the tile schedule is parallel <strong>to</strong> the face thatallows concurrent start. We achieve 72.9% of the theoreticalmachine peak compared <strong>to</strong> 66.2% for Pochoir.For Heat 2D, we perform better than both Pochoir andPlu<strong>to</strong> (Figure 15(a)). Performance of Plu<strong>to</strong> generated codesaturates after 8 cores for the same reasons as mentioned for1d-heat. Our scheme performs 12.7% better than Pochoir whenusing 12 cores on the Intel configuration. We also tested thethree schemes on the AMD Opteron setup, and we see similarimprovements with our scheme (Figure 15(b)).On the Heat 3D benchmark, Plu<strong>to</strong> performs poorly whileboth Pochoir and our scheme scale well (Figure 17). We usepartial concurrent start in one dimension for our code for thisexample. Pochoir achieves about 21% of the machine peak andour scheme achieves 24.2%. CPU utilization is as expectedwith our technique while Plu<strong>to</strong> suffers from load imbalancehere. For our scheme, though scaling seen with 1,2,4,8,12cores is almost ideal, it is flat between 5 and 8 cores. This


Time (seconds)900800700600500400300200100Ourplu<strong>to</strong>-currentpochoirTime (ms)10080604020Ourplu<strong>to</strong>-currentpochoir00 2 4 6 8 10 1200 2 4 6 8 10 12Number of coresNumber of coresFig. 16.Game of LifeFig. 18.American Put Option Pricing50Ourplu<strong>to</strong>-currentpochoir3025Ourplu<strong>to</strong>-currentpochoir4020GFLOPs3020GFLOPs151010500 2 4 6 8 10 12Number of cores00 2 4 6 8 10 12Number of coresFig. 17.Heat 3DFig. 19.3D 7-point <strong>Stencil</strong>is due <strong>to</strong> the number of tiles available not being a multipleof the number of threads. This load imbalance can easily beeliminated using dynamic scheduling techniques [3] and weplan <strong>to</strong> do this in future. In addition, the tile <strong>to</strong> thread mappingwe employed is not conscious of locality, i.e., inter-tile reusecan be exploited with another suitable mapping. Integratingthe complementary techniques presented in [33] may improveperformance further.The ‘Game of Life’ does not use any floating-point operationsin the stencil. Performance is thus reported directlyvia running time in Figure 16. All schemes see a substantialdecrease in running time when moving from single core<strong>to</strong> two cores. Plu<strong>to</strong> generated code does not provide muchimprovement beyond 8 cores.Both, our scheme and Pochoir exhibit a similar performancetrend for APOP. Beyond seven cores, there is no increase inperformance owing <strong>to</strong> the problem size.For 3d7pt, we again outperform Pochoir and Plu<strong>to</strong> (Figure19). The variations in performance with our code are due<strong>to</strong> the same reasons as those mentioned for 3d-heat.VI. RELATED WORKA significant amount of work has been done on optimizingstencil computations. These works fall in<strong>to</strong> two categories.One body of works is in the form of developing compileroptimizations [37], [29], [22], [7], and the other on buildingdomain-specific stencil compilers or domain-specific optimizationstudies [20], [12], [32], [33], [8], [31], [35], [34]. Amongthe compiler techniques, the polyhedral compiler works havebeen implemented in some production compilers [15], [30],[6], and are also available publicly as <strong>to</strong>ols [26], [27]. Theyare in the form of more general transformation frameworksand more or less subsume previous compiler works for stencils.Other significant body of works studies optimizationspecific <strong>to</strong> stencils, and a number of these works are veryrecent. Many of these are publicly available as well such asPochoir [34]. Such systems have the opportunity <strong>to</strong> providemuch higher performance than compiler-based ones owing <strong>to</strong>


Benchmark Performance (1 core) Performance (12 cores) Speedup overOriginal (icc) pochoir plu<strong>to</strong>-cur Our icc-par pochoir plu<strong>to</strong>-cur Our icc-par pochoir plu<strong>to</strong>-cur1d-heat 5.73s 1.94s 3.01s 1.77s 2.61s 171.7ms 1.58s 155.8ms 16.75 1.10 10.142d-heat 7.65m 4.93m 6.76m 3.42m 8.44m 28.43s 65.31s 25.22s 20.08 1.13 2.592d-heat-amd 13.23m 8.71m 9.99m 5.01m 12.30m 50.52s 84.57s 48.11ms 15.34 1.05 1.753d-heat 1.61s 849.2ms 1.27s 871ms 1.60s 156.6ms 446ms 135ms 11.85 1.16 3.3game-of-life 13.50m 9.86m 14.15m 10.60m 13.50m 52.51s 2.98m 64.09s 12.64 0.82 2.79apop 81.72ms 49.98ms 76.43ms 73.43ms 46.70ms 12.19ms 63.40ms 11.70ms 3.99 1.04 5.423d7pt 1.61s 1.05s 1.68s 919.8ms 1.6s 181.9ms 995.3ms 143.2ms 11.23 1.27 6.95TABLE IIISUMMARY OF PERFORMANCEthe greater amount of knowledge they have about computation.Pochoir [34] uses a cache-oblivious tiling mechanism withtiles shaped like trapezoids. Such a tiling mechanism will notsuffer from load imbalance. Our scheme on the other hand is adependence-driven one and is oblivious <strong>to</strong> the input code beinga stencil. Experimental evaluation included a comparison withPochoir. In addition, though our techniques are implementedin a polyhedral source-<strong>to</strong>-source compiler, they apply <strong>to</strong> eitherbody of works.Strzodka et al. [33] present a technique, cache accurate timeskewing (CATS), for tiling stencils, and report significant improvemen<strong>to</strong>ver Plu<strong>to</strong> and other simpler manual optimizationstrategies for 2-d and 3-d domains. Diamond-shaped tiles thatare au<strong>to</strong>matically found by our technique are also used byCATS in a particular way. However, their scheme also paysattention <strong>to</strong> additional orthogonal aspects such as mapping oftiles <strong>to</strong> threads and is presumably far more cache-friendly thanour scheme. We plan <strong>to</strong> explore those complementary aspectsand integrate them in<strong>to</strong> our technique in future. While ours isan end-<strong>to</strong>-end au<strong>to</strong>matic compiler framework, CATS is moreof a cus<strong>to</strong>mized optimization system.Efficient vec<strong>to</strong>rization of stencils is challenging. Henrettyet al. [17] present a data layout transformation technique <strong>to</strong>improve vec<strong>to</strong>rization. However, reported improvement waslimited for architectures that provide nearly the same performancefor unaligned loads as for aligned loads. This is thecase with Nehalem, Westmere, and Sandy Bridge architecturesfrom Intel. We did not thus explore layout transformationsfor vec<strong>to</strong>rization, but instead relied on Intel compiler’s au<strong>to</strong>vec<strong>to</strong>rizationafter au<strong>to</strong>matic application of enabling transformationswithin a tile.Krishnamoorthy et al [22] addressed concurrent start whentiling stencil computations. However, the approach workedby starting with a valid tiling and correcting it <strong>to</strong> allowconcurrent start. This was done via overlapped execution oftiles (overlapped tiling) or by splitting a tile in<strong>to</strong> sub-tiles(split tiling). With such an approach, one misses natural waysof tiling that inherently do not suffer from pipelined startup.We have showed that, in all those cases, there exist validtiling hyperplanes that allow concurrent start. The key problemwith both overlapped tiling and split tiling is the difficulty inperforming code generation au<strong>to</strong>matically. No implementationof these techniques currently exists. In addition, our conditionsfor concurrent start work with arbitrary affine dependenceswhile those in [22] were presented for uniform dependences,i.e., when dependences can be represented by constant distancevec<strong>to</strong>rs.VII. CONCLUSIONSWe have designed and evaluated new techniques for tilingstencil computations. These techniques, in addition <strong>to</strong> theusual benefits of tiling, provide concurrent start-up wheneverpossible. We showed that tile-wise concurrent start isenabled if the face of the iteration space that allows pointwiseconcurrent start strictly lies in the cone formed by alltiling hyperplanes. We then provided an approach <strong>to</strong> findsuch hyperplanes. The presented techniques are au<strong>to</strong>matic andhave been implemented in a source-level parallelizer, Plu<strong>to</strong>.Experimental evaluation on a 12-core Intel multicore showsthat our code is able <strong>to</strong> outperform a tuned domain-specificstencil code genera<strong>to</strong>r in some cases by about 4% <strong>to</strong> 27%. Inother cases, we outperform previous compiler techniques by afac<strong>to</strong>r of 2× <strong>to</strong> 10.14× on 12 cores over a set of benchmarks.Our implementation will be available in a future release ofPlu<strong>to</strong> shortly.REFERENCES[1] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: principles, techniques,and <strong>to</strong>ols. Addison-Wesley Longman Publishing Co., Inc., Bos<strong>to</strong>n, MA,USA, 1986.[2] C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. InACM SIGPLAN symposium on Principles and Practice of ParallelProgramming, pages 39–50, 1991.[3] M. Baskaran, N. Vydyanathan, U. Bondhugula, J. Ramanujam, A. Rountev,and P. Sadayappan. Compiler-assisted dynamic scheduling foreffective parallelization of loop nests on multicore processors. In ACMSIGPLAN PPoPP, pages 219–228, 2009.[4] C. Bas<strong>to</strong>ul. Clan: The Chunky Loop Analyzer. Clan user guide.[5] U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam,A. Rountev, and P. Sadayappan. Au<strong>to</strong>matic transformations forcommunication-minimized parallelization and locality optimization inthe polyhedral model. In ETAPS CC, Apr. 2008.[6] U. Bondhugula, O. Gunluk, S. Dash, and L. Renganarayanan. A modelfor fusion and code motion in an au<strong>to</strong>matic parallelizing compiler. InProceedings of the 19th international conference on Parallel architecturesand compilation techniques, PACT ’10, pages 343–352. ACM,2010.[7] U. Bondhugula, A. Har<strong>to</strong>no, J. Ramanujam, and P. Sadayappan. Apractical au<strong>to</strong>matic polyhedral program optimization system. In ACMSIGPLAN PLDI, June 2008.[8] M. Christen, O. Schenk, and H. Burkhart. Patus: A code generationand au<strong>to</strong>tuning framework for parallel iterative stencil computationson modern microarchitectures. In Parallel Distributed ProcessingSymposium (IPDPS), 2011 IEEE International, pages 676 –687, may2011.[9] CLooG: The Chunky Loop Genera<strong>to</strong>r. http://www.cloog.org.


[10] A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache.Facilitating the search for compositions of program transformations. InACM International conference on Supercomputing, pages 151–160, June2005.[11] K. Datta. Au<strong>to</strong>-tuning <strong>Stencil</strong> Codes for Cache-Based MulticorePlatforms. PhD thesis, EECS Department, University of California,Berkeley, Dec 2009.[12] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. A.Patterson, J. Shalf, and K. A. Yelick. <strong>Stencil</strong> computation optimizationand au<strong>to</strong>-tuning on state-of-the-art multicore architectures. In SC, page 4,2008.[13] M. Gardner. Mathematical Games. Scientific American, 1970.[14] M. Griebl, P. Feautrier, and A. Größlinger. Forward communication onlyplacements and their use for parallel program construction. In LCPC,pages 16–30, 2005.[15] T. Grosser, H. Zheng, R. Aloor, A. Simbrger, A. Grolinger, and L.-N.Pouchet. Polly polyhedral optimization in LLVM. In Proceedings ofthe First International Workshop on Polyhedral Compilation Techniques(IMPACT), 2011, 2011.[16] A. Har<strong>to</strong>no, M. Manik, A. Baskaran, C. Bas<strong>to</strong>ul, A. Cohen, S. Krishnamoorthy,B. Norris, and J. Ramanujam. Primetile: A parametric multileveltiler for imperfect loop nests, 2009.[17] T. Henretty, K. S<strong>to</strong>ck, L.-N. Pouchet, F. Franchetti, J. Ramanujam, andP. Sadayappan. Data layout transformation for stencil computationson short simd architectures. In ETAPS International Conference onCompiler Construction (CC’11), pages 225–245, Saarbrucken, Germany,Mar. 2011.[18] J. Hull. Options, futures, and other derivatives. Pearson Prentice Hall,Upper Saddle River, NJ [u.a.], 6. ed., pearson internat. ed edition, 2006.[19] F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the15th ACM SIGPLAN-SIGACT symposium on Principles of programminglanguages, POPL ’88, pages 319–329, 1988.[20] S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yellick.Implicit and explicit optimization for stencil computations. In ACMSIGPLAN workshop on Memory Systems Perofmance and Correctness,2006.[21] W. Kelly and W. Pugh. A unifying framework for iteration reorderingtransformations. Technical Report CS-TR-3430, Dept. of <strong>Computer</strong>Science, University of Maryland, College Park, 1995.[22] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam,A. Rountev, and P. Sadayappan. Effective Au<strong>to</strong>matic Parallelization of<strong>Stencil</strong> <strong>Computations</strong>. In ACM SIGPLAN symposium on ProgrammingLanguages Design and Implementation, July 2007.[23] A. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm<strong>to</strong> maximize parallelism and minimize communication. In ACM ICS,pages 228–237, 1999.[24] A. Lim and M. S. Lam. Maximizing parallelism and minimizingsynchronization with affine partitions. Parallel Computing, 24(3-4):445–475, 1998.[25] PIP: The Parametric Integer Programming Library.http://www.piplib.org.[26] PLUTO: A polyhedral au<strong>to</strong>matic parallelizer and locality optimizer formulticores. http://plu<strong>to</strong>-compiler.sourceforge.net.[27] POCC: Polyhedral compiler collection. http://pocc.sourceforge.net.[28] J. Ramanujam and P. Sadayappan. <strong>Tiling</strong> multidimensional iterationspaces for multicomputers. Journal of Parallel and Distributed Computing,16(2):108–230, 1992.[29] L. Renganarayanan, M. Harthikote-Matha, R. Dewri, and S. V. Rajopadhye.Towards optimal multi-level tiling for stencil computations. InIPDPS, pages 1–10, 2007.[30] RSTREAM - High Level Compiler, Reservoir Labs.http://www.reservoir.com.[31] N. Sedaghati, R. Thomas, L.-N. Pouchet, R. Teodorescu, and P. Sadayappan.StVEC: A vec<strong>to</strong>r instruction extension for high performance stencilcomputation. In 20th International Conference on Parallel Architectureand Compilation Techniques (PACT’11), Galves<strong>to</strong>n Island, Texas, Oct.2011.[32] R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache obliviousparallelograms in iterative stencil computations. In ICS, pages 49–59,2010.[33] R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accuratetime skewing in iterative stencil computations. In ICPP, pages 571–581,2011.[34] Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E.Leiserson. The pochoir stencil compiler. In SPAA, pages 117–128,2011.[35] J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelizationstrategies for iterative stencil computations. Journal ofComputational Science, 2(2):130 – 137, 2011.[36] M. J. Wolfe. High Performance Compilers for Parallel Computing.Addison-Wesley Longman Publishing Co., Inc., Bos<strong>to</strong>n, MA, USA,1995.[37] D. Wonnacott. Using time skewing <strong>to</strong> eliminate idle time due <strong>to</strong>memory bandwidth and network limitations. In Proceedins of the 14thInternational Parallel and Distributed Processing Symposium (IPDPS),pages 171 –180, 2000.[38] J. Xue. Loop tiling for parallelism. Kluwer Academic Publishers,Norwell, MA, USA, 2000.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!