Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab

More documents

Recommendations

Info

3 4 1 5 11 J 12 13 I 19 20 21 2 A B Figure 2-5: The data footprint of a processor (a) for the expression e‘�“‘�“ ae‘�CI“‘�CI“C e‘�CI“‘�CP“ and (b) for the more general expression in which ��Cand ��Care the largest positive offsets for the induction variables � and �, respectively; and ��0 and ��0 are the smallest negative offsets. The white area represents data for which a value is calculated, while the shaded areas are the additional data needed for the calculations. a partitioning groups closely associated iterations on one processor, thereby increasing the temporal locality by maximizing data reuse. When an iteration needs a particular array cell, the cell is cached and available to later iterations on the same processor. Because a network or memory access occurs only once per unique array cell, and because the suggested tile dimensions minimize the number of different array references; such a task partition minimizes the total access time and is optimal. The details of optimal task partitioning are contained in [AKN92], but determining the optimal aspect ratio for a 2-D loop nest will quickly be presented here. The derivation of the optimal (to a first approximation) aspect ratio is rather simple. Finding the I and J resulting in minimal communication we compute their ratio I/J. This is performed in the following manner. The tile size is � as 2 t. Communication (to a first approximation) is the number of rows and columns of nonlocal data. Where is the number of rows and is the number of columns, the total communication in a multiprocessor with caches is as Ct as C � s dj- di- a � t CtX To obtain the I and J that minimize communication, we calculate the derivative of 20 di+ dj+
communication with respect to the variable and find where it has zero value: � a 0 �s � aH A s a s P � � a 0 � � � �t � aHAt a t P The optimal aspect ratio is the ratio of equation 2.1 to equation 2.2, which becomes s t a � � � � a � P P a a 5 �� —�� 5�� —�� Thus, the optimal aspect ratio for Figure 2-5a is sataIaP, and the ratio for Figure 2-5b is s t a ��CC��0 ��CC��0 . 2.3.2 Optimal Data Partitioning When only one loop nest and one array exist, the optimal data partition is exactly the optimal task partition. Such a data partition reduces most of the network accesses to local memory accesses by locating directly on a processor the cells for which it is responsible to compute values. In Figure 2-5a, the white portion would be in Processor 12’s memory, and the shaded region would be the remote data needed from Processors 4, 5, and 13. In general, however, the optimal data partition is harder to obtain. Alignment is the process of attempting to place the data footprint accessed by a task on the same processor as the task. Details of obtaining the optimal data partition parameters can be found in [AKN92]. 2.4 The Problem Most programs are not of the single loop nest, single array type. Instead, multiple loop nests with multiple arrays make alignment difficult. The resulting poor alignment induces fragmentation of a processor’s array references, causing accesses across several memory modules over the execution of the loop nest. 21 X @PXIA @PXPA
Page 1 and 2: Compile-time Loop Splitting for Dis
Page 3 and 4: Acknowledgments There are several p
Page 5 and 6: 3.3.1 Code Hoisting XXXXXXXXXXXXXXX
Page 7 and 8: 4-1 The performance improvement in
Page 9 and 10: Chapter 1 Introduction Distributed
Page 11 and 12: then interpreted. Section 5 conclud
Page 13 and 14: M P NETWORK M Figure 2-1: The distr
Page 15 and 16: I 0 99 0 24 25 49 A B 0 50 74 75 99
Page 17 and 18: 99 0 0 I 99 J 99 50 49 0 0 2 3 0 1
Page 19: to the processors, and the optimal
Page 23 and 24: location of the array element. Figu
Page 25 and 26: aref (aref (A, s / i-spread) , s %
Page 27 and 28: educe the array referencing computa
Page 29 and 30: Original After Arbitrary Loop Split
Page 31 and 32: Original After Code Hoisting c = ..
Page 33 and 34: Optimization Example Generalization
Page 35 and 36: Original After Loop Splitting for(i
Page 37 and 38: aref (aref (A, I / i-spread) , I %
Page 39 and 40: Chapter 4 Loop Splitting Analysis F
Page 41 and 42: 4.2 Implemented Methods Before proc
Page 43 and 44: Perf. Improvement 8 7 6 5 4 3 2 1 |
Page 45 and 46: calculations, but also allows more
Page 47 and 48: Chapter 5 Conclusions This section
Page 49 and 50: This method was not implemented due
Page 51 and 52: Appendix A Performance Benchmarks E
Page 53 and 54: A.2 Matrix Add 100x50 The 5000-elem
Page 55 and 56: A.3 Matrix Multiplication 40x40 The
Page 57 and 58: (DO ((J-REM-14 (FX-REM J-BMIN 14) (
Page 59 and 60: (LET* ((Y-BMIN (FX+ (CAR Y-INT) 0))
Page 61 and 62: B.1 Normal ;;; ;;; /donald/splittin
Page 63 and 64: B.2 Rational ;;; ;;; Source: loop2.
Page 65 and 66: (cdr tslist) (cdr intlist) cvars do
Page 67 and 68: (symbol->string (caar ind-list)))))
Page 69 and 70: B.3 Interval (format t "˜s --> ˜s
Page 71 and 72:
(j-s (cadr s-list)) (j-ds (cadr ds-
Page 73:
(map (lambda (arr-list) (second-lis
show all

Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab

Create successful ePaper yourself

Delete template?

Save as template?