Automatic parallelization and optimization for irregular ... - IEEE Xplore

for linear subscripts are completely disabled), because mostof practical irregular scientific applications have this kindof loops.Generally, in distributed memory compilation, loop iterationsare partitioned to processors according to the ownercomputes rule [1]. This rule specifies that, on a singlestatementloop, each iteration will be executed by the processorwhich owns the left hand side array reference of theassignment for that iteration.However, owner computes rule is often not best suitedfor irregular codes. This is because use of indirection inaccessing left hand side array makes it difficult to partitionthe loop iterations according to the owner computers rule.Therefore, in CHAOS library, Ponnusamy et al. [13, 14]proposed a heuristic method for irregular loop partitioningcalled almost owner computes rule, in which an iteration isexecuted on the processor that is the owner of the largestnumber of distributed array references in the iteration.Some HPF compilers employ this scheme by usingEXECUTE-ON-HOME clause [15]. However, when weparallelize a fluid dynamics solver ZEUS-2D code by usingalmost owner computes rule, we find that the almostowner computes rule is not optimal manner in minimizingcommunication cost — either communication steps or elementsto be communicated. Another drawback is that it isnot straightforward to choose optimal owner if several processorsown the same number of array references.We propose a more efficient computes rule for irregularloop partition [8]. This approach partitions iterations on aparticular processor such that executing the iteration on thatprocessor ensures• the communication steps is minimum, and• the total number of data to be communicated is minimumIn our approach, neither owner computes rule nor almostowner computes rule is used in parallel execution ofa loop iteration for irregular computation. A communicationcost reduction computes rule, called least communicationcomputes rule, is proposed. For a given irregularloop, we first investigate for all processors P k , 0 ≤k ≤ m (m is the number of processors) in which two setsFanIn(P k ) and FanOut(P k ) for each processor P k aredefined. FanIn(P k ) is a set of processors which have tosend data to processor P k before the iteration is executed,and FanOut(P k ) is a set of processors which have to senddata from processor P k after the iteration is executed. Accordingto these knowledge we partition the loop iterationto a processor on which the minimal communication is ensuredwhen executing that iteration. Then, after all iterationsare partitioned into various processors. Please refer to[8] for details.3 Global-local Array Transformation and IndexArray RemappingThere are two approaches to generating SPMD irregularcodes after loop partitioning. One is receiving required datain an iteration every time before the iteration is executed,and send the changed values (which are resident in otherprocessors originally) to other processors after the iterationis executed. Another is gather all remote data from otherprocessors for all iterations executed on this processor andscatter all changed remote values to other processors afterall iterations are executed. Because message aggregation isa main communication optimization means, obviously thelater one is better for communication performance. In orderto perform communication aggregation, this section discussesredistribution of indirection arrays.3.1 Index Array RedistributionAfter loop partitioning analysis, all iterations havebeen assigned to various processors: iter(P 0 ) ={i 0,1 ,i 0,2 ,...,i 0,α0 },..., iter(P m−1 )={i m−1,1 ,i m−1,2 ,...,i m−1,αm−1 }. If an iteration i r is partitioned to processorP k , the index array elements ix(i r ),iy(i r ), ···maynot be certainly resident in P k . Therefore, we needto redistribute all index arrays so that for iter(P k ) ={i k,1 ,i k,2 ,...,i k,αk } and every index array ix, elementsix(i k,1 ),...,ix(i k,αk ) are local accessible.As mentioned above, The source BLOCK partitionscheme for processor P k is src iter(P k )={⌈ g m ⌉∗k +1..(⌈ g ⌉ +1)∗ k}, where g is the number of iterations ofmthe loop. Then in a redistribution, the elements of an indexarray need to communicate from P k to Pk ′ can be indicatedby src iter(P k ) ∩ iter(Pk ′ ). Back to the Example1, same as Example 3, let the size of data array and indexarrays be 12, after loop partitioning analysis, we can obtainiter(P 0 )={1, 5, 8, 9, 10}, iter(P 1 )={2, 3, 4}, anditer(P 2 )={6, 7, 11, 12}. The index array elements to beredistributed are shown in Figure 2.3.2 Scheduling in Redistribution ProcedureA redistribution routine can be divided into two part:subscript computation and interprocessor communication.If there is no communication scheduling in a redistributionroutine, communication contention may occur, which increasesthe communication waiting time. Clearly in eachcommunication step, there are some processors sendingmessages to the same destination processor. This leads tonode contention. Node contention will result in overall performancedegradation.[5, 6] The scheduling can avoid thiscontention.Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE

src_iteriterP0 P1 P2P0 P1 P1 P1 P0 P2 P2 P0 P0 P0P2 P2P1 P0 ={5,8} P2 P0 = {9, 10}P0 P1 = {2, 3, 4} P2 P1 ={}P0 P2 ={} P1 P2 = {6, 7}iixiyizP0 P1 P21 5 8 9 10 2 3 4 6 7 11125 1 4 6 8 7 9 11 2 4 10124 1 10 2 4 4 3 3 6 9 6 84 8 2 1 3 5 6 7 1012 9 11Figure 2. Subscript computations of index array redistributionin Example 1.Another consideration is to align messages so that thesize of messages as near as possible in a step, since the costof a step is likely to be dictated by the length of the longestmessage exchanged during the step. We use a sparse matrixM to represent the communication pattern between thesource and target processor sets where M(i, j) =M expressesa sending processor P i should send an M size messageto a receiving processor Q j . The following matrix describesa redistribution pattern and a scheduling pattern.⎡M =⎢⎣0 3 0 10 310 0 2 1 22 2 0 3 910 0 2 0 00 4 3 3 0⎤ ⎡⎥⎦ ,CS = ⎢⎣1 4 34 2 0 30 3 4 10 23 1 2We have proposed an algorithm that accepts M as inputand generates a communication scheduling table CS to expressthe scheduling result where CS(i, k) =j means thata sending processor P i sends message to a receiving processorQ j at a communication step k. The scheduling satisfies:• there is no node contention in each communicationstep; and• the sum of longest message length in each step is minimum.4 Inter-procedural Communication Optimizationfor Irregular LoopsIn some irregular scientific codes, an important optimizationrequired is communication preprocessing among⎤⎥⎦procedure calls. In this section, we extend a classical dataflow optimization technique – Partial Redundancy Elimination– to an Interprocedural Partial Redundancy Eliminationas a basis for performing interprocedural communicationoptimization [2]. Partial Redundancy Elimination encompassestraditional optimizations like loop invariant codemotion and redundant computation elimination.For irregular code with procedure call, initial intraproceduralanalysis (see [8]) inserts pre-communicating call (includingone buffering and one gathering routine) and postcommunicating(buffering and scattering routine) call foreach of the two data parallel loops in two subroutines. Afterinterprocedural analysis, loop invariants and can be hoistedoutside the loop.Here, data arrays and index arrays are the same in loopbodies of two subroutines. While some communicationstatement may not be redundant, there may be some othercommunication statement, which may be gathering at leasta subset of the values gathered in this statement.In some situations, the same data array A is accessed usingan indirection array IA in one subroutine SUB1 andusing another indirection array IB in another subroutineSUB2. Further, none of the indirection arrays or the dataarray A is modified between flow control from first loop tothe second loop. There will be at least some overlap betweenrequired communication data elements made in thesetwo loops. Another case is that the data array and indirectionarray is the same but the loop bound is different. In thiscase, the first loop can be viewed as a part of the secondloop.We divide two kinds of communication routines for suchsituations. A common communication routine takes morethan one indirection array, or takes common part of two indirectionarrays. A common communication routine willtake in arrays IA and IB producing a single buffering. Incrementalpreprocessing routine will take in indirection arrayIA and IB, and will determine the off-processor referencesmade uniquely through indirection array IB and notthrough indirection array IA. While executing the secondloop, communication using an incremental schedule can bedone, to gather only the data elements which were not gatheredduring the first loop.5 OpenMP Extensions for Irregular ApplicationsOpenMP’s programming model uses fork-join parallelism:master thread spawns a team of threads as needed.Parallelism is added incrementally: i.e. the sequential programevolves into a parallel program. Hence, we do nothave to parallelize the whole program at once. OpenMP isusually used to parallelize loops. A user finds his most timeProceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE

consuming loops in his code, and split them up betweenthreads.OpenMP is a model for programming any parallel architecturethat provides the abstraction of a shared addressspace to the programmer. The purpose of OpenMP is to easethe development of portable parallel code, by enabling incrementalconstruction of parallel programs and hiding thedetail of the underlying hardware/software interface fromthe programmer. A parallel OpenMP program can be obtaineddirectly from its sequential counterpart, by addingparallelization directives.If the loop shown in Fig. 1 is split across OpenMPthreads then, although threads will always have distinct valuesof j, the values of prev_elam and next_elem maysimultaneously have the same values on different threads.As a result, there is a potential problem with updating thevalue of new_cell. There are some simple solutions tothis problem, which include making all the updates atomic,or having each thread compute temporary results which arethen combined across threads. However, for the extremelycommon situation of sparse array access neither of these approachesis particularly efficient.5.1 Directive Extension for IrregularLoopsThis section provides the new extensions to OpenMP, theirregular directive. The irregular directive couldbe applied to the parallel do directive in one of thefollowing situations:• When the parallel region is recognized as an irregularloop: in this case the compiler will invoke a runtimelibrary which partitions irregular loop according to aspecial computes rule.• When an ordered clause is recognized in the parallelregion where the loop is irregular: in this case thecompiler will treat this loop as a partial ordered; that is,some iterations are executed sequentially while someothers may be executed in parallel.• When an reduction clause is recognized in the parallelregion where the loop is irregular: in this case thecompiler will invoke an inspector/executor routines toperform irregular reduction in parallel.The irregular directives in extended OpenMP versionmay have the following patterns:$!omp parallel do schedule(irregular,[irarray1,...,irarrayN])This case designates that the compiler will encounteran irregular loop, where irarray1,...,irarrayN arepossible indirection arrays, or$!omp parallel do reduction|orderedirregular([expr1,...,exprN])This case designates that the compiler will encounter aspecial irregular reduction or irregular ordered loop whereexpr1, ..., exprN are expressions such as loop indexvariables.5.2 Partially Ordered LoopsA special case for the example in Fig. 1 occurs when theshared updates need not only to be performed in mutual exclusion,but also in an ordered way. The use of the irregularclause in this case tells the compiler that for thoseiterations which may update the same data in the differentthreads, they need to be executed in an ordered way. Otheriterations are still executed in parallel. An example of codeusing the indirect clause in this manner is the following:Example 1$!omp parallel do ordered irregular(x,y)do i = 1, nx[i] = indirect(1,i)y[i] = indirect(2,i)$!omp ordereda(x[i]) = a(y[i]) ......$!omp end orderedend do$!omp end parallel do6 ConclusionsThe efficiency of loop partitioning influences performanceof parallel program considerably. For automaticallyparallelizing irregular scientific codes, the owner computesrule is not suitable for partitioning irregular loops. In thispaper, we have presented an efficient loop partitioning approachto reduce communication cost for a kind of irregularloop with nonlinear array subscripts. In our approach, runtimepreprocessing is used to determine the communicationrequired between the processors. We have developed thealgorithms for performing these communication optimization.We have also presented how interprocedural communicationoptimization can be achieved. Furthermore, if irregularcodes are parallelized in OpenMP, we also proposedto extend OpenMP directives to be suitable for compilingand executing such codes. We have done a preliminary implementationof the schemes presented in this paper. Theexperimental results demonstrate efficacy of our schemes.Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE

References[1] R. Allen and K. Kennedy. Optimizing compilers forModern Architectures. Morgan Kaufmann Publishers,2001.[2] G. Agrawal and J. Saltz. Interprocedural compilationof Irregular Applilcations for Distributed memory machines.Language and Compilers for Parallel Computing,pp. 1-16, August 1994.[3] R. Das, M. Uysal, J. Saltz, and Y-S. Hwang. Communicationoptimizations for irregular scientific computationson distributed memory architectures. Journalof Parallel and Distributed Computing, 22(3):462–479, September 1994.[4] C. Ding and K. Kennedy. Improving cache performanceof dynamic applications with computation anddata layout transformations. In Proceedings of theSIGPLAN’99 Conference on Programming LanguageDesign and Implementation, Atlanta, GA, May, 1999.[5] M. Guo, I, Nakata, and Y. Yamashita. Contentionfreecommunication scheduling for array redistribution.Parallel Computing, 26(2000), pp. 1325-1343,2000.[12] M. Berry, D. Chen, P. Koss, D. Kuck, S. Lo, Y. Pang,R. Roloff, A. Sameh, E. Clementi, S. Chin, D. Schneider,G. Fox, P. Messina, D. Walker, C. Hsiung, J.Schwarzmeier, K. Lue, S. Orzag, F. Seidl, O. Johnson,G. Swanson, R. Goodrum, and J. Martin. The PER-FECT club benchmarks: effective performance evaluationof supercomputers. International Journal of SupercomputingApplications, pp. 3(3):5-40, 1989.[13] R. Ponnusamy, Y-S. Hwang, R. Das, J. Saltz, A.Choudhary, G. Fox. Supporting irregular distributionsin Fortran D/HPF compilers. Technical report CS-TR-3268, University of Maryland, Department of ComputerScience, 1994[14] R. Ponnusamy, J. Saltz, A. Choudhary, S. Hwang,and G. Fox. Runtime support and compilation methodsfor user-specified data distributions. IEEE Transactionson Parallel and Distributed Systems, 6(8), pp.815-831, 1995.[15] M. Ujaldon, E.L. Zapata, B.M. Chapman, and H.P.Zima. Vienna-Fortran/HPF extensions for sparse andirregular problems and their compilation. IEEE Transactionson Parallel and Distributed Systems. 8(10),Oct. 1997. pp. 1068 -1083.[6] M. Guo and I. Nakata. A framework for efficient arrayredistribution on distributed memory machines. TheJournal of Supercomputing, Vol. 20, No. 3, pp. 253-265, 2001.[7] M. Guo, Y. Pan, and C. Liu. Symbolic CommunicationSet generation for irregular parallel applications. Toappear in The Journal of Supercomputing, 2002.[8] M. Guo, Z. Liu, C. Liu, L. Li. Reducing Communicationcost for Parallelizing Irregular Scientific Codes.In Proceedings of The 6th International Conferenceon Applied Parallel Computing, Finland, June 2002.[9] E. Gutierrez, R. Asenjo, O. Plata, and E.L. Zapata. Automaticparallelization of irregular applications. ParallelComputing, 26(2000), pp. 1709-1738, 2000.[10] Y.-S Hwang, B. Moon, S. D. Sharma, R. Ponnusamy,R. Das, and J. Saltz. Runtime and language support forcompiling adaptive irregular programs on distributedmemory machines. Software-Practivce and Experience,Vol.25(6), pp. 597-621,1995.[11] J.M. Stone and M. Norman. ZEUS-2D: A radiationmagnetohydrodynamics code for astrophysical flowsin two space dimensions: The hydrodynamic algorithmsand tests. Astrophysical Journal SupplementSeries, Vol. 80, pp. 753-790, 1992.Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Automatic parallelization and optimization for irregular ... - IEEE Xplore

Create successful ePaper yourself

Delete template?

Save as template?