Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab
Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab
Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3<br />
4 1 5<br />
11 J 12<br />
13<br />
I<br />
19 20 21<br />
2<br />
A B<br />
Figure 2-5: The data footprint of a processor (a) <strong>for</strong> the expression e‘�“‘�“ ae‘�CI“‘�CI“C<br />
e‘�CI“‘�CP“ and (b) <strong>for</strong> the more general expression in which ��Cand ��Care the largest<br />
positive offsets <strong>for</strong> the induction variables � and �, respectively; and ��0 and ��0 are the<br />
smallest negative offsets. The white area represents data <strong>for</strong> which a value is calculated,<br />
while the shaded areas are the additional data needed <strong>for</strong> the calculations.<br />
a partitioning groups closely associated iterations on one processor, thereby increasing the<br />
temporal locality by maximizing data reuse. When an iteration needs a particular array<br />
cell, the cell is cached and available to later iterations on the same processor. Because a<br />
network or memory access occurs only once per unique array cell, and because the<br />
suggested tile dimensions minimize the number of different array references; such a task<br />
partition minimizes the total access <strong>time</strong> and is optimal.<br />
The details of optimal task partitioning are contained in [AKN92], but determining<br />
the optimal aspect ratio <strong>for</strong> a 2-D loop nest will quickly be presented here.<br />
The derivation of the optimal (to a first approximation) aspect ratio is rather simple.<br />
Finding the I and J resulting in minimal communication we compute their ratio I/J. This is<br />
per<strong>for</strong>med in the following manner.<br />
The tile size is � as 2 t. Communication (to a first approximation) is the number of<br />
rows and columns of nonlocal data. Where is the number of rows and is the number of<br />
columns, the total communication in a multiprocessor with caches is<br />
as Ct as C<br />
�<br />
s<br />
dj-<br />
di-<br />
a �<br />
t CtX<br />
To obtain the I and J that minimize communication, we calculate the derivative of<br />
20<br />
di+<br />
dj+