13.07.2015 Views

More Iteration Space Tiling

More Iteration Space Tiling

More Iteration Space Tiling

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4) The tile loops are reordered and optimized. Again,this may involve reshaping the tile iteration spacevia loop skewing and interchanging. The optimizationat this level will depend on the model of parallelismused by the system, and the dependence constraintsbetween tiles. The method described in(IrTr88] has one outermost serial loop surroundingseveral inner parallel tile loops, using loop skewing(wavefronting) in the tile iteration space to satisfyany dependence relations. We also wish to takeadvantage of locality between tiles by giving eachprocessor either a rule for which tile to execute nextor at least a preference for which direction in the tileiteration space to process tiles to best take advantageof locality. The sizes of the tiles are also set atthis time.Let us show some simple examples to illustratezation process.the optimi-Example 1: Given a simple nested sequential loop,such as Program 19a, let us see how tiling would optimizethe loop for multiple vector processors with private caches.For a simple vector computer, we would be tempted tointerchange and vectorize the I loop, because it gives achained multiply-add vector operation and all the memoryreferences are stride-l (with Fortran column-major storage;otherwise the J loop would be used); this is shown in Program19b. However, if the column size (N) was larger thanthe cache size, each pass through the K loop would have toreload the whole column of A into the cache.Program 19a:do I = 1, Ndo J = 1, MA(I,J) = 0.0doK=l, LA(I,J) = A(I,J) + B(I,K)*C(K,J)enddoenddoenddoProgramlSb:do J = 1, MA(1:N.J) = 0.0doK=l, LA(l:N,J) = A(1enddoenddo:N,J)+ B(l:N,K)*C(K,J)For a simple multiprocessor, we might be tempted tointerchange the J loop outwards and parallelize it, as inProgram 19c, so that each processor would operate on distinctcolumns of A and C. Each pass through the K loopwould again have to reload the cache with a row of B andcolumn of C if L is too large.Program19c:doall J = 1, Mdo I = 1, NA(1.J) = 0.0doK=l, LA(I,J) = A(I,J) + B(I,K)*C(K,J)enddoenddoenddoInstead, let us attempt to tile the entire iterationspace. We will use symbolic names for the tile size in eachdimension, since determining the tile sizes will be donelater. <strong>Tiling</strong> the iteration space can proceed even thoughthe loops are not perfectly nested. Essentially, each loop isstrip-mined, then the strip loops are interchanged outwardsto become the tile loops. The tiled program isshown in Program 19d.Program 19d:do IT = 1, N, ITSdo JT = 1, M, JTSdo I = IT, MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)A(I,J) = 0.0enddoenddodo KT = 1, L, KTSdo I = IT, MIN(N,IT+ITS-1)do J = 1, MIN(M,JT+JTS-1)do K = 1, MIN(L,KT+KTS-1)A(1.J) = A(I,J) + B(I,X)*C(K,J)enddoenddoenddoenddoenddoenddoEach set of element loops is ordered to provide the kind oflocal performance the machine needs. The first set of elementloops has no locality (the footprint of A is ITSXJTS,the same size as the iteration space), so we need onlyoptimize for vector operations and perhaps memory stride;we do this by vectorizing the I loop.IVL = MIN(N,IT+ITS-1)do J = 1, MIN(M,.JT+JTS-1)A(IT:IT+IVL,J) = 0.0enddoThe second set of element loops can be reordered 6 ways;the JKI ordering gives stride-l vector operations in theinner loop, and one level of locality for A in the secondinner loop (the footprint of A in the K loop is ITS whilethe iteration space is ITSXKTS. Furthermore, if ITS isthe size of a vector register, the footprint of A fits into avector register during that loop, meaning that the vectorregister load and store of A cart be floated out of the K loopentirely.662

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!