29.11.2012 Views

Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab

Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab

Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3<br />

4 1 5<br />

11 J 12<br />

13<br />

I<br />

19 20 21<br />

2<br />

A B<br />

Figure 2-5: The data footprint of a processor (a) <strong>for</strong> the expression e‘�“‘�“ ae‘�CI“‘�CI“C<br />

e‘�CI“‘�CP“ and (b) <strong>for</strong> the more general expression in which ��Cand ��Care the largest<br />

positive offsets <strong>for</strong> the induction variables � and �, respectively; and ��0 and ��0 are the<br />

smallest negative offsets. The white area represents data <strong>for</strong> which a value is calculated,<br />

while the shaded areas are the additional data needed <strong>for</strong> the calculations.<br />

a partitioning groups closely associated iterations on one processor, thereby increasing the<br />

temporal locality by maximizing data reuse. When an iteration needs a particular array<br />

cell, the cell is cached and available to later iterations on the same processor. Because a<br />

network or memory access occurs only once per unique array cell, and because the<br />

suggested tile dimensions minimize the number of different array references; such a task<br />

partition minimizes the total access <strong>time</strong> and is optimal.<br />

The details of optimal task partitioning are contained in [AKN92], but determining<br />

the optimal aspect ratio <strong>for</strong> a 2-D loop nest will quickly be presented here.<br />

The derivation of the optimal (to a first approximation) aspect ratio is rather simple.<br />

Finding the I and J resulting in minimal communication we compute their ratio I/J. This is<br />

per<strong>for</strong>med in the following manner.<br />

The tile size is � as 2 t. Communication (to a first approximation) is the number of<br />

rows and columns of nonlocal data. Where is the number of rows and is the number of<br />

columns, the total communication in a multiprocessor with caches is<br />

as Ct as C<br />

�<br />

s<br />

dj-<br />

di-<br />

a �<br />

t CtX<br />

To obtain the I and J that minimize communication, we calculate the derivative of<br />

20<br />

di+<br />

dj+

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!