More Iteration Space Tiling

More Iteration Space TilingMichael WolfeOregon Graduate CenterAbstractSubdividing the iteration space of a loop into blocksor tiles with a fixed maximum size has several advantages.Tiles become a natural candidate as the unit of work forparallel task scheduling. Synchronization between processorscan be done between tiles, reducing synchronizationfrequency (at some loss of potential parallelism). Theshape and size of a tile can be optimized to take advantageof memory locality for memory hierarchy utilization.Vectorization and register locality naturally fits into theoptimization within a tile, while parallelization and cachelocality fits into optimization between tiles,Keywords: parallelization,data dependence1. Introductionmemory hierarchy optimization,Advanced compilers are capable of many programrestructuring transformations, such as vectorization, concurrencydetection and loop interchanging. In addition toutilization of multiple vector processors, program performanceon large systems now depends on effective use of thememory hierarchy. The memory hierarchy may comprise avirtual address space residing in a large (long latency)shared main memory, with either a shared cache memoryserving all the processors or with a private cache memoryserving each processor. In addition, vector processors typicallyhave vector registers, which should be usedeffectively.It has long been known that program restructuringcan dramatically reduce the load on a memory hierarchysubsystem [AbuS78, AbKL81, Wolf87]. A recent paper([IrTr88]) describes a procedure to partition the iterationspace of a tightly-nested loop into supernodes, where eachsupernode comprises a set of iterations that will bescheduled as an atomic task on a processor. That procedureworks from a new data dependence abstraction,called the dependence cone. The dependence cone is usedto find legal partitions and to find dependence constraintsbetween supernodes.Permission to copy without fee all or pan of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the ACM copyright notice and the titke of the publication and its date appear,and notice is given that copying is by permission of the Association forComputing Machinery. To copy otherwise, or to republish, requires a feeand/or specific permission.0 1989 ACM 089791-341-g/89/001 l/O655 $1.50This paper recasts and extends that work by usingseveral elementary program restructuring transformationsto optimize programs with the same goals. Our proceduresare not restricted to tightly-nested loops and do notdepend on a new dependence abstraction, although it willbenefit from precise information.The next section describes and reviews basic conceptssuch as the iteration space of a loop, data dependenceand parallel loop execution. The three most populardata dependence abstractions are explained, along with anew abstraction inspired by [IrTr88]. Section 3 introducesthe elementary loop restructuring transformations used inthis paper: loop interchanging, loop skewing, strip miningand tiling. Section 4 defines the footprint of an arrayreference with respect to a loop; this is the basis for optimizationfor memory locality. Section 5 enumerates theoptimization goals for a compiler, while Section 6 describesthe optimization process, and includes several examples.The final section has some concluding remarks.2. Iteration Space and Data DependenceWe say that nested iterative loops (such as FortranDO or Pascal FOR loops) define an iteration space,comprising a finite discrete Cartesian space with dimensionalityequal to the loop nest level. For example, thetwo loops in Program 1 traverse the two-dimensional iterationspace shown in Figure 1. The semantics of a serialloop define how the iteration space is traversed (from leftto right, top to bottom in our figures).Program 1:do I = 1, 5do J = 1, 10A(I,J) = B(I,J) + C(I)*D(J)enddoenddoThere is no reason that the iteration space need berectangular; many popular algorithms have inner loopswhose limits depend on the values of outer loop indices.The iteration space for Program 2 is triangular, as shownin Figure 2, suggesting the name triangular loop. Otherinteresting iteration space shapes can be defined by nestedloops, such as trapezoids, rhomboids, and so on; we willshow how some of these shapes can be generated from eachother via loop restructuring.655

Program 3:do I = 1, N-ls, : A(I+l) = A(1) + B(1)enddo5Figure 1.Program 2:do I = I, 5do J = I, 5A(1.J) = B(I,J) + C(I)*D(J)enddoenddoI=12345J=l 2 3 4 5Figure 2.Data Dependence: Many compilers available todayfor vector and parallel computers advertise the ability todetect vector or parallel operations from serial loops.Parallelism is detected by discovering the essential dataflow (or data dependence) in the loop and allowing vectoror parallel execution when data dependence relations arenot violated. Loop restructuring transformations, such asloop interchanging, are often applied to enhance the availableparallelism or otherwise optimize performance; datadependence information is needed to test whether restructuringtransformations are legal (the program produces thesame answer after restructuring as it did before).There are three essential kinds of data dependence,though in this paper they will be treated identically. Aflow-dependence relation occurs when the value assigned toa variable or array element in the execution of oneinstance of a statement is used (read, fetched) by the subsequentexecution of an instance of the same or anotherstatement. Program 3 has a flow dependence relation fromstatement S1 to itself, since the value assigned to A (I +l)will be used on the next iteration of the loop. We usuallywrite this s1 b s,.An anttdependence relation occurs when the valueread from a variable or array element in an instance ofsome statement is reassigned by a subsequent instance ofsome statement. In Program 4 there is an anti-dependencerelation from S1 to S2, since B (I, J) is used in S, and subsequentlyreassigned by s2 in the same iteration of theloop. We usually write this S, F s2.Program 4:do I = 1, Ndo J = 1, Ms, : A(I,J) = B(I,J) + 1s2 : B(I,J) = C(1.J) -1enddoenddoFinally, an output dependence relation occurs whensome variable or array element is assigned in an instanceof a statement and reassigned b:y a subsequent instance ofsome statement, An example of this is Program 5 wherethere is an potential output dependence relation from S, toS,, since the variable B (I + 1) assigned in S, may be reassignedin the next iteration of the loop by S,. We usuallywrite this S, 6O S,. This also shows that the data dependencerelations in a program must be approximated; sincea compiler will not know the actual paths taken in theprogram, it must make conservative assumptions.Program 5:do I = 1, N-ls,: if(A(1) > 0) B(1) = C(I)/A(I)s, : B (Ill) = C(1) / 2enddoIn order to apply a wide variety of loop transformations,data dependence relations are annotated with informationshowing how they are related to the enclosingloops. Three such annotations are popular today. Manydependence relations have a constant distance in eachdimension of the iteration space. When this is the case, adistance vector can be built where each element is a constantinteger representing the dependence distances in thecorresponding loop. For example, in Program 6 there is adata dependence relation in the iteration space as shownin Figure 3; each iteration (i ,, j) depends on the valuecomputed in iteration (i, j -1) We say that the distancesfor this dependence relation are zero in the I loopand one in the J loop, and we write this S1 61001) S1.Program 6:do I = 1, Ndo J = 2, Ms,: A(I, J) = A(I,J-.L) + B(I,J)enddoenddoFor many transformations, the actual distance ineach loop may not be so important as just the sign of the656

J=l 2 3 4 5 5 . . .I=1 C2c3c...Figure 3.distance in each loop; also, often the distance is not constantin the loop, even though it may always be positive(or negative).Program 7:do I = 1, Ndo J = 1, Ns1 : X(I+1,2*J) = X(I,J) + B(1)enddoenddoAs an example, in Program 7 the assignment toX (I +l, 2 *J) is used in some subsequent iteration of the Iand J loops by the X (I, J) reference. Let S, [I=l, J=l]refer to the instance of statement S, executing for theiteration when the loop variables I=1 and J=l. ThenS1 [I=l, J=l] assigns X (2,2>, which is used byS, [ 1=2, J=2], for a dependence distance of (1,l) ; however,S1 [1=2, J=2] assigns X (3,4), which is used byS, [ I=3, J=4], for a dependence distance of (1,2) Thedistance for this dependence in the J loop is always positive,but is not a constant. A common method torepresent this is to save a vector of the signs of the dependencedistances, usually called a direction vector. Eachdirection vector element will be one of {+, 0, -) [Bane88];for historical reasons, these are usually written ()[WoBa87, WolfSS]. In Program 7, we can associate thedirection vector with the dependence relation by writingS, 6,,*,) S1, where in Program 6, the dependence relationwould be written s1 6,=,,) s,.Another (often more precise) annotation is inspiredby [IrTr88]. Instead of saving only the sign of the distance(which loses a great deal of information about any actualdistances), save a set of distance vectors from which allpotential actual dependence distances can be formed bylinear combination. In Program 7, for example, we wouldstill save the distance vectors + (1,l) and (0,l) , sinceall actual dependence distances are linear combinations ofthese distances; the + in + (1,l) means that the linearcombination must include a non-zero coefficient for thisdistance vector, while the other coefficients must be nonnegative.Another popular data dependence annotation savesonly the nest level of the outermost loop with a positivedistance [AlKe87]. The dependence relation for Program 6has a zero distance in the outer loop, but a positive dis-tance in the inner loop, so we would write S, ~5~ S,. Wealso say that this dependence relation is carried by theinner J loop. Some dependence relations may not be carriedby any loop, as in Program 8.Program 8:do I = 1, Ndo J = 2, Ms1: A(I,J) = B(I,J) + C(I,J)s, : D(I,J) = A(I,J) + 1enddoenddoHere the references to A (I, J) produce a dependence relationfrom S1 to S2 with zero distance in both loops. Wewould thus say S1 &(c,c) S, or SI 6,,,=) S1. Since it iscarried by neither of the loops, we call it a loop independentdependence, represented S1 60° S,.2.1. Parallel Loop ExecutionWe represent a parallel loop with a doall statement.We assume that a parallel loop can be executed in a multiprocessorby scheduling the iterations of the loop ondifferent processors in any order, with no synchronizationbetween the iterations. The iteration space of a doall isthe same as that of sequential loop except that the traversalof a parallel loop is unordered. If we replace the outerloop of Program 1 by a doall, as in Program 9, the iterationspace traversal would be as in Figure 4, with a singlefork, followed by 5 parallel unordered iterations of I (eachof which executed the 10 iterations of J), followed by a singlejoin.Program 9:doall I = 1, 5doJ=l, 10A(I,J) = B(I,J) + C(I)*D(J)enddoenddoI=15J=l 2 3 4 5 6 7 8 9 10Figure 4.If we instead replace the inner loop by a doall, as inProgram 10, the iteration space traversal would be as inFigure 5, where each iteration of the I loop would containa fork, followed by 10 parallel unordered iterations of J,followed by a join. It is obvious that in most cases, thebest performance will result when the number of forks andjoins is minimized, which occurs when doalls are at theoutermost nest level.657

Program 10:do I = I, 5doallJ=l, 10A(1.J) = B(I,J) + C(I)*D(J)enddoenddoJ=l 2 3 4 5 6 7 8 9 10Figure 5.It is also obvious that a sequential loop may be convertedinto a doall when it carries no dependence relationsFor example, in Program 11 there is a flowdependencerelation S1 6f1,1j S, due to the assignmentand subsequent use of A. Even though the distance in theJ loop dimension is non-zero, it may be executed in parallelsince the only dependence relation is curried by theouter I loop. The outer I loop can be executed in parallelonly by the insertion of synchrolzization primitives.Program 11:do I = 2, Ndo J = 3, MS 1: A(1.J) = A(I-l,J-2) + C(I)*D(J)enddoenddo3. Restructuring TransformationsThe most powerful compilers and translators arecapable of advanced program restructuring transformationsto optimize performance on high speed parallel computers.Automatic conversion of sequential code to parallelcode is one example of program restructuring. Associatedwith each restructuring transformation is a datadependence test which must be satisfied by each dependencerelation in order to apply that transformation. Aswe have already seen, converting a sequential loop to aparallel doall requires that the Ioop carries no dependence.This parallelization has no effect on the data dependencegraph, though we will see that other transformations dochange data dependence relations somewhat.Loop Interchanging: One of the most importantrestructuring transformations is loop interchanging. Interchangingtwo loops can be used with several different goalsin mind. As shown above, the outer loop of Program 11cannot be converted to a parallel doall without additionalsynchronization, However, the two loops can be inter-changed, producing Program 12. Loop interchanging islegal if there are no dependence relations that are carriedby the outer loop and have a negative distance in the innerloop (i.e., no () direction vectors [AlKe84, WoBa87]).The distance or direction vector for the data dependencerelation in the interchanged loop has the correspondingelements interchanged, giving the dependence relationS, 6(,:,) S,. Since the outerm,ost loop with a positive distanceis the outer J loop, the J loop carries this dependence;now the I loop carries no dependences and can beexecuted in parallel. Loop interchanging thus enablesparallel execution of other loops; this may be desirable if,for instance, it is known that M is very small (so parallelexecution of the J loop would give little speedup) or ifparallel access to the second dimension of A would producememory conflicts.Program 12:do J = 3, Mdo I = 2, NS 1: A(I,J) = A(I-l,J-2) + C(I)*D(J)enddoenddoLoop Skewing: Some nested loops have dependencerelations carried by each loop, preventing parallel executionof any of the loops. An example of this is the relaxationalgorithm shown in Program 13a. The data dependencerelations in the iteration space of this loop areshown in Figure 6; the four dependence relations have distancevectors:Sl ~(CLl) SlSl J(1.0) SlSl b(O.1) Sl Sl T&O) SlOne way to extract parallelism from this loop is via thewavefront (or hyperplane) method [Mura71, Lamp74]. Weshow how to implement the wavefront method via loopskewing and loop interchanging [Wolf86].Program 13a:do I = 2, N-ldo J = 2, M-lS 1: A(I,J)=O.2*(A(I-l,J)+A(I,J-1)+A(I,J)+A(I+l,J)+A(I,J+l))enddoenddoFigure 6.658

Program 13b:do I = 2, N-ldo J = 1+2, I+M-1s1: A(I,J-1)=0.2*(A(I-l,J-I)+A(I,J-I-1)+A(I,J-I)+A(I+l,J-I)+A(I,J-I+l))enddoenddoFigure i’.Program 13~:do J = 4, N+M-2do I = MAX(2,J-M+l), MIN(N-l,J-2)S 1: A(I,J-1)=0.2*(A(I-l,J-I)+A(I,J-I-1)+A(I,J-I)+A(I+l,J-I)+A(I,J-I+l))enddoenddoLoop skewing changes the shape of the iterationspace from a rectangle to a parallelogram. We can skewthe J loop of Program 13a with respect to the I loop byadding I to the upper and lower limits of the J loop; thisrequires that we then subtract I from J within the loop.The skewed loop is shown in Program 13b and the skewediteration space is shown in Figure 7. The direction vectorsfor the data dependence relations in the skewed loop willchange from (d,, d,) to (d,, dl+dl) , so the modifieddependence relations are:Sl 6(0,1) SlSl 6(1,1) SlSl J(O.1) Sl Sl X(1,1) SlInterchanging the skewed loops requires some clevermodifications to the loop limits, as shown in Program 13~.As before, interchanging the two loops requires that weswitch the corresponding elements in the direction vectors,giving:Sl F(LO) Sl Sl 6(1,1) SlSI X(1,0) $1 Sl J(l,l) SlNotice that in each case, the direction vector has a positivevalue in the first element, meaning that each dependencerelation is carried by the outer loop (the J loop);thus, the skewed and interchanged I loop can be executedin parallel, which gives us the wavefront formulation.Strip Mining: Vectorizing compilers often divide asingle loop into a pair of loops, where the maximum tripcount of the inner loop is equal to the maximum vectorlength of the machine. Thus, for a Cray vector computer,the loop in Program 14a will essentially be converted intothe pair of loops in Program 14b. This process is calledstrip mining [Love77]. The original loop is divided intostrips of some maximum size, the strip size; in Program14b, the inner loop (or element loop) has a strip size of 64,which is the length of the Cray vector registers. The outerloop (the IS loop, for “strip loop”) steps between thestrips; on the Cray, the I loop corresponds to the vectorinstructions.Program 14a:do I = 1, Ns1: A(I) = A(I) + B(1)sa: C(I) = A(I-1) * 2enddoProgram 14b:do IS = 1, N, 64do I = IS, MIN(N,IS+63)s1: A(1) = A(1) + B(1)s,: C(1) = A(I-1) * 2enddoenddoStrip mining is always legal; however it does have aneffect on the data dependence relations in the loop. Asstrip mining adds a loop, it adds a dimension to the iterationspace; thus it must also add an element to the distanceor direction vector. When a loop is strip mined, adependence relation with a (d) in the distance vector forthat loop produces one or two dependence relations. If d isa multiple of the strip size ss, then a distance vector (d)is changed to the distance vector (d/ss, 0) If d is not amultiple of ss, then a distance vector (d) generates twodependence relations, with distance vectors:( d,dmodss) (d, -d mod ss)ssSSI JI 1In either case, if the original dependence distance is largerthan (or equal to) the strip size, then after strip mining thestrip loop will carry that dependence relation, allowingparallel execution of the element loop.Iteration Space Tiling: When nested loops arestrip mined and the strip loops are all interchanged to theoutermost level, the result is a tiling of the iteration space.The double-nested loop in Program 15a can be tiled tobecome the four-nested loop in Program 15b. Thiscorresponds to dividing the two dimensional iterationspace for Program 15a into “tiles”, as shown in Figure 8.Each tile corresponds to the inner two element loops, andthe outer two “tile” loops step between the tiles.Program 15a:do I = 1, Ndo J = 1, Ns1: A(1.J) = A(I,J) + B(I,J)s1: C(I,J) = A(I-l,J) * 2enddoenddo659

egisters, iocal memory), then they need to be loaded onlyonce (either automatically, as in a hardware-managedcache, or by additional software, as for registers).Program Ma:do I = 1, Ndo J = 1, MA(I,J) = B(J) * C(1)enddoenddoProgram 18b:do I = 1, N, 32do J = 1, M, 32do I = IT, MIN(N, IT+31)do J = JT, MIN(M, JT+31)A(I,J) = B(J) * C(1)enddoenddoenddoenddoWe note here that there may be other problems withfinding and optimizing for footprints, First, given a cachememory environment, a cache line may be more than oneword wide. On the Sequent Symmetry, for example, acache line is 2 words wide; thus, when a word is loadedinto the cache memory, one of its neighbors is draggedalong also, whether or not it is wanted. If a footprintcomprised (say) 32 consecutive words, then at most 2unneeded words would be dragged into the cache; if howeverthe footprint comprised a row in an array stored bycolumns, then each word would drag another word into thecache. This could potentially double the amount of cachememory used for this footprint; wider cache lines exacerbatethe problem. This (or other considerations) mayinduce a preferred ordering when processing tiles.Second, for software managed memory hierarchies,we need to not only optimize the footprint size, but weneed to be able to identify it. Usually this is no problem,as it will consist of a starting position, a stride and alength.5. Optimization GoalsGiven our toolkit of restructuring transformations,we wish to optimize nested loops for execution on multiprocessorswith a memory hierarchy, where each processormay also have vector instructions. We tile the iterationspace such that each tile will be a unit of work to beexecuted on a processor. Communication between processorswill not be allowed during execution of a tile. Tileswill be optimized to provide locality and vector execution.The scheduling of tiles onto processors will be done to provideeither locality across parallel tiles or not, dependingon the memory hierarchy organization.Atomic Tiles: Each tile is a unit of work to bescheduled on a processor. Once a tile is scheduled on aprocessor, it runs to completion without preemption. Atile will not be initiated until al! dependence constraintsfor that tile are satisfied, so there will never be a reasonthat a tile, once started, should have to relinquish the pro-cessor,Parallelism between Tiles: The tiles should bearranged in the iteration space to allow for as much paratlelismbetween tiles as possible. If there is dependence inone dimension and not another, then the tile size may beadjusted so that each tile has a small size in the independentdimension to allow for more independent tiles alongthat dimension. Depending on how parallelism is implemented,the tile loops may need to be reordered and/orskewed to implement synchronization between tiles.Vectors within Tiles: If the processors have vectorinstructions, then the innermost loop should be vectorized.This corresponds to ordering the element loops so that theinnermost element loop is vector. This goal may be somewhatinconsistent with the next goal.Locality within Tiles: The size of the tiles will beadjusted so as to provide good usage of the memory hierarchy.When no data reuse occurs, the ordering of the loopswithin a tile will not matter (there is no locality anyway);when data reuse does occur, the ordering of the loops willbe optimized to take advantage of locality at least in thefirst and second levels.Locality between Tiles: In the best case, all thedata for a single tile will fit into the highest level of thememory hierarchy (cache, perhaps) allowing the optimizerto look for reuse between tiles. When adjacent tiles in theiteration space share much or all of the data, then theoptimizer should try to schedule those tiles on the sameprocessor. If multiple processors share a cache, then paralleltiles which share much of the same data should bescheduled onto those processors at the same time to takeadvantage of the shared cache. If multiple processors donot share a cache, then parallel tiles scheduled at the sametime should be those which do not share data, to preventmemory interference.8. Optimization ProcessThe tiling optimizationsteps, described below:process consists of several distinct1) The iteration space may be reshaped, through loopskewing. This will give differently shaped tiles inthe next step.2) The iteration space is tiled. Tiling essentially consistsof strip-mining each loop and interchanging thestrip loops outwards to become the tile loops, thoughthere are some slight complexities that should behandled properly for triangular loop limits. The tilesize in each dimension is set in the next two steps.3) The element loops are reordered and optimized. Wecan optimize for ‘locality by reordering until theinner loops have the smallest total footprint. Wemay also optimize for vector instructions or memorystrides in the inner loop. The iteration space of thetile may be reshaped via loop skewing and loopinterchanging in this step also. Some limits on tilesizes may be set in this step to provide for localitywithin certain levels of the memory hierarchy (suchas vector registers).661

4) The tile loops are reordered and optimized. Again,this may involve reshaping the tile iteration spacevia loop skewing and interchanging. The optimizationat this level will depend on the model of parallelismused by the system, and the dependence constraintsbetween tiles. The method described in(IrTr88] has one outermost serial loop surroundingseveral inner parallel tile loops, using loop skewing(wavefronting) in the tile iteration space to satisfyany dependence relations. We also wish to takeadvantage of locality between tiles by giving eachprocessor either a rule for which tile to execute nextor at least a preference for which direction in the tileiteration space to process tiles to best take advantageof locality. The sizes of the tiles are also set atthis time.Let us show some simple examples to illustratezation process.the optimi-Example 1: Given a simple nested sequential loop,such as Program 19a, let us see how tiling would optimizethe loop for multiple vector processors with private caches.For a simple vector computer, we would be tempted tointerchange and vectorize the I loop, because it gives achained multiply-add vector operation and all the memoryreferences are stride-l (with Fortran column-major storage;otherwise the J loop would be used); this is shown in Program19b. However, if the column size (N) was larger thanthe cache size, each pass through the K loop would have toreload the whole column of A into the cache.Program 19a:do I = 1, Ndo J = 1, MA(I,J) = 0.0doK=l, LA(I,J) = A(I,J) + B(I,K)*C(K,J)enddoenddoenddoProgramlSb:do J = 1, MA(1:N.J) = 0.0doK=l, LA(l:N,J) = A(1enddoenddo:N,J)+ B(l:N,K)*C(K,J)For a simple multiprocessor, we might be tempted tointerchange the J loop outwards and parallelize it, as inProgram 19c, so that each processor would operate on distinctcolumns of A and C. Each pass through the K loopwould again have to reload the cache with a row of B andcolumn of C if L is too large.Program19c:doall J = 1, Mdo I = 1, NA(1.J) = 0.0doK=l, LA(I,J) = A(I,J) + B(I,K)*C(K,J)enddoenddoenddoInstead, let us attempt to tile the entire iterationspace. We will use symbolic names for the tile size in eachdimension, since determining the tile sizes will be donelater. Tiling the iteration space can proceed even thoughthe loops are not perfectly nested. Essentially, each loop isstrip-mined, then the strip loops are interchanged outwardsto become the tile loops. The tiled program isshown in Program 19d.Program 19d:do IT = 1, N, ITSdo JT = 1, M, JTSdo I = IT, MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)A(I,J) = 0.0enddoenddodo KT = 1, L, KTSdo I = IT, MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1, MIN(L,KT+KTS-1)A(1.J) = A(I,J) + B(I,X)*C(K,J)enddoenddoenddoenddoenddoenddoEach set of element loops is ordered to provide the kind oflocal performance the machine needs. The first set of elementloops has no locality (the footprint of A is ITSXJTS,the same size as the iteration space), so we need onlyoptimize for vector operations and perhaps memory stride;we do this by vectorizing the I loop.IVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,.JT+JTS-1)A(IT:IT+IVL,J) = 0.0enddoThe second set of element loops can be reordered 6 ways;the JKI ordering gives stride-l vector operations in theinner loop, and one level of locality for A in the secondinner loop (the footprint of A in the K loop is ITS whilethe iteration space is ITSXKTS. Furthermore, if ITS isthe size of a vector register, the footprint of A fits into avector register during that loop, meaning that the vectorregister load and store of A cart be floated out of the K loopentirely.662

IVL = MTN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1, MIN(L,KT+KTS-1)A(IT:IT+IVL,J) = A(IT:IT+IVL,J) +B(IT:IT+IVL,K)*C(K,J)enddoenddoSince there are no dependence constraints betweentiles along the IT and JT dimensions, those two loops canbe executed in parallel. The method suggested in [IrTr88]will ‘wavefront’ the tile iteration space by having onesequential outermost loop surrounding parallel doalls;thus, the final program would as in Program lge. Notethat the tile loops had to be distributed (their formulationonly dealt with tightly nested loops); also, the nesteddoalls inside the KT loop will generate KTS fork/joinoperations. If processors are randomly assigned to iterationsof the doalls (and thus to tiles), the system will notbe able to take advantage of locality between tiles.Program 19,:doall IT = 1, N, ITSdoall JT = 1, M, JTSIVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)A(IT:IT+IVL,J) = 0.0enddoenddoenddodo KT = 1, L, KTSdoall IT = 1, N, ITSdoall JT = 1, M, JTSIVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1. MIN(L,KT+KTS-1)A(IT:IT+IVL,J) = A(IT:IT+IVL,J) +B(IT:IT+IVL,K)*C(K,J)enddoenddoenddoenddoenddoAnother obvious method is to leave the paralleldoalls outermost, as in Program 19f. This generates a singlefork/join operation, but the size of the parallel task ismuch larger, meaning there is less opportunity for loadbalancing. However, a single parallel task now comprisesall KTS tiles along the KT dimension. Each iteration of theKT loop uses the same the footprint of A, so scheduling alliterations on the same processor takes advantage of thatlocality between tiles.Program 19f:doall IT = 1, N, ITSdoall JT = 1, M, JTSIVL = MIN(N,IT+ITS-1)da J = 1, MIN(M,JT+JTS-1)A(IT:IT+IVL,J) = 0.0enddodo KT = 1, L, KTSIVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1, MIN(L,KT+KTS-1)A(IT:IT+IVL,J) = A(IT:IT+IVL,J) +B(IT:IT+IVL,K)*C(K,J)enddoenddoenddoenddoenddoExample 2: The example used in [IrTr88] is a fivepoint difference equation, as was shown in Program 13.We will show how our methods can derive the two partitioningsshown in their paper.The first partition (Figure 2 of [IrTr88]) starts byskewing the iteration space before tiling, as in Figure 10a;each tile is executed with vector instructions along the Idimension. To satisfy data dependence relations betweenvertically adjacent tiles, the tile iteration space is thenskewed again, as in Figure lob; in this figure, verticallyaligned tiles can be executed concurrently on different processors.This could be implemented by a wavefrontmethod (sequential loop surrounding a doall), or by assigningtiles long the J dimension to the same processor andsynchronizing between tiles along the I dimension.(a)(b)Figure 10.The second partition (Figure G of [IrTr88]) tiles theiteration space first, as in Figure lla, then skews each tileto get vector operations, as in Figure lib. Finally, the tileiteration space is skewed to satisfy dependences betweenvertically adjacent tiles, resulting in Figure llc; again,processors can be assigned to rows with time flowing to theright.663

7. Summary(a)J---,(c)FigureWe have described several elementary program restructuringtransformations which can be combined withparallelism detection to optimize programs for executionon parallel processors with memory hierarchies. Thesetechniques are similar to those described in [IrTr88], butare more general and simpler to apply in the setting of acompiler or other program restructuring tool. Before anyof these techniques are implemented in a compiler we needto understand the complexity of the optimization process.Given the data dependence information, it is simple to discoverwhether and how a loop can be tiled. The difficultyis trying to find the optimal loop ordering. This can be a0 (d! ) problem, where d is the loop depth, since we mayhave to consider each possible loop ordering. This is thencomplicated by the possibility of skewing the iterationspace before tiling or skewing each tile individually. Theprocedure used here does have the advantage of decouplingthe optimization of the code within a tile from optimizationbetween tiles, reducing the complexity from0 ( (Zd) ! ) to just 0 (d ! ) For loops that are not verydeeply nested, the actual computation at each step in theoptimization process is relatively small (computation ofthe footprints and dependences between iterations), so anexhaustive search of the loop orderings may be reasonable.Il.References[AbuS78]Walid Abdul-Karim Abu-Sufah, Improving the Performanceof Virtual Memory Computers, Ph.D.Thesis, Dept. of Comp. Sci. Rpt. No. 78-945, Univ. ofIllinois, Urbana, IL, Nov., 1978; available as document79-15307 from University Microfilms, AnnArbor, MI.[AbKLS l]W. A. Abu-Sufah, D. J. Kuck and D. H. Lawrie, “Onthe Performance Enhancement of Paging SystemsThrough Program Analysis and Transformations,”IEEE Trans. on Computers, Vol. C-30, No. 5, pp.341-356, May 1981.[AlKe84]John R. Allen and Ken Kennedy, “Automatic LoopInterchange,” Proc. of the ACM SIGPLAN ‘84 Symposiumon Compiler Construction, Montreal, Canada,June 17-22, 1984, SIGPLAN Notices Vol. 19, No. 6,pp. 233-246, June 1984.[AlKe87]John R. Allen and Ken Kennedy, “Automatic Translationof Fortran Programs to Vector Form,” ACMTransactions on Programming Languages and Systems,Vol. 9, No. 4, pp. 491-542, October 1987.(Bane881Utpal Banerjee, Dependence Analyss for Supercomputing,Kluwer Academic Publishers, Norwell, MA,1988.(IrTr88]R. Irigoin and R. Triolet) “Supernode Partitioning,“Conf. Record of the 15th Annual ACM Symp. onPrinciples of Programming Languages, pp. 319-329,Jan. 13-15, San Diego, CA, ACM Press, New York,1988.[Lamp741Leslie Lamport, “The Parallel Execution of DOLoops,” Comm. of the ACM, Vol. 17, No. 2, pp. 83-93, Feb., 1974.(Love771D. Loveman, “Program Improvement by Source-to-Source Transformation,” J. of the ACM, Vol. 20, No.1, pp. 121-145, Jan. 1977.[Mura71]Yoichi Muraoka, Parallellism Exposure and Exptoitationin Programs, Ph.D. Thesis, Dept. of Comp. Sci.Rpt. No. 71-424, Univ. of Illinois, Urbana, IL, Feb.,1971./Wolf861Michael Wolfe, “Loop Skewing: The WavefrontMethod Revisited,” Int? Journal of Parallel Programming,Vol. 15, No. 4, pp. ,279-294, Aug. 1986.[WoBa87]Michael Wolfe and Utpal Banerjee, “Data Dependenceand Its Application to Parallel Processing,”Int’l Journal of Parallel Programming, Vol. 16, No. 2,pp. 137-177, April, 1987.[Wolf871Michael Wolfe, “Iteration Space Tiling for MemoryHierarchies,” Proc. of the 3rd SIAM Conf. on ParallelProcessing for Scientific Computing, Garry Rodrigue(ed), Society for Industrial and Applied Mathematics,Philadelphia, PA, pp, 357-361, 1987.[Wolf89]Michael Wolfe, Optimizing Supercompilers for Supercomputers,MIT Press, Boston, 1989.664

More Iteration Space Tiling

Create successful ePaper yourself

Delete template?

Save as template?