13.07.2015 Views

More Iteration Space Tiling

More Iteration Space Tiling

More Iteration Space Tiling

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IVL = MTN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1, MIN(L,KT+KTS-1)A(IT:IT+IVL,J) = A(IT:IT+IVL,J) +B(IT:IT+IVL,K)*C(K,J)enddoenddoSince there are no dependence constraints betweentiles along the IT and JT dimensions, those two loops canbe executed in parallel. The method suggested in [IrTr88]will ‘wavefront’ the tile iteration space by having onesequential outermost loop surrounding parallel doalls;thus, the final program would as in Program lge. Notethat the tile loops had to be distributed (their formulationonly dealt with tightly nested loops); also, the nesteddoalls inside the KT loop will generate KTS fork/joinoperations. If processors are randomly assigned to iterationsof the doalls (and thus to tiles), the system will notbe able to take advantage of locality between tiles.Program 19,:doall IT = 1, N, ITSdoall JT = 1, M, JTSIVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)A(IT:IT+IVL,J) = 0.0enddoenddoenddodo KT = 1, L, KTSdoall IT = 1, N, ITSdoall JT = 1, M, JTSIVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1. MIN(L,KT+KTS-1)A(IT:IT+IVL,J) = A(IT:IT+IVL,J) +B(IT:IT+IVL,K)*C(K,J)enddoenddoenddoenddoenddoAnother obvious method is to leave the paralleldoalls outermost, as in Program 19f. This generates a singlefork/join operation, but the size of the parallel task ismuch larger, meaning there is less opportunity for loadbalancing. However, a single parallel task now comprisesall KTS tiles along the KT dimension. Each iteration of theKT loop uses the same the footprint of A, so scheduling alliterations on the same processor takes advantage of thatlocality between tiles.Program 19f:doall IT = 1, N, ITSdoall JT = 1, M, JTSIVL = MIN(N,IT+ITS-1)da J = 1, MIN(M,JT+JTS-1)A(IT:IT+IVL,J) = 0.0enddodo KT = 1, L, KTSIVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1, MIN(L,KT+KTS-1)A(IT:IT+IVL,J) = A(IT:IT+IVL,J) +B(IT:IT+IVL,K)*C(K,J)enddoenddoenddoenddoenddoExample 2: The example used in [IrTr88] is a fivepoint difference equation, as was shown in Program 13.We will show how our methods can derive the two partitioningsshown in their paper.The first partition (Figure 2 of [IrTr88]) starts byskewing the iteration space before tiling, as in Figure 10a;each tile is executed with vector instructions along the Idimension. To satisfy data dependence relations betweenvertically adjacent tiles, the tile iteration space is thenskewed again, as in Figure lob; in this figure, verticallyaligned tiles can be executed concurrently on different processors.This could be implemented by a wavefrontmethod (sequential loop surrounding a doall), or by assigningtiles long the J dimension to the same processor andsynchronizing between tiles along the I dimension.(a)(b)Figure 10.The second partition (Figure G of [IrTr88]) tiles theiteration space first, as in Figure lla, then skews each tileto get vector operations, as in Figure lib. Finally, the tileiteration space is skewed to satisfy dependences betweenvertically adjacent tiles, resulting in Figure llc; again,processors can be assigned to rows with time flowing to theright.663

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!