The Power of Belady's Algorithm in Register Allocation for Long ...

quently. In the next section, we explain the compiler we designedto perform the register allocation on the FFT and MM codes.3 A Compiler for Long Straight-line CodeBelady’s MIN [2] is a replacement algorithm for pages in virtualmemory. On a replacement, the page to replace is the one withthe farthest next use. The MIN algorithm is optimal because itgenerates the minimal number of physical memory block replacements.However, in this context of virtual memory the MIN algorithmis impractical in most of the applications, because it isusually not known which memory block will be referenced in thefuture.Belady’s MIN algorithm has been proposed to use for registerallocation in long basic blocks, where the compiler knows exactlythe values that will be used in the future. In this context, the MINalgorithm is also known as Farthest First(FF) [6] since, on a registerreplacement, the register to replace first is the one holding thevalue with the farthest next use.The MIN algorithm is not optimal for register allocation sincethe replacement decision is simply based on the distance and noton whether the register has been modified. When a register holdsa value that is not consistent with the value in memory, we say thatthe register is dirty. Otherwise, we say that the register is clean. Ifthe register to be replaced is dirty, the value of the register needs tobe stored back to memory before a new value can be loaded into it.Thus, for a given instruction scheduling the MIN algorithm guaranteesthe minimum number of register replacements, but it doesnot guarantee the minimum traffic with memory, that is, the minimumnumber of load/stores. In our implementation, when thereare several candidates for replacement with the same distance, ourcompiler chooses the one with the clean state to avoid an extrastore.In order to further reduce the number of stores, another simpleheuristic called Clean First (CF) was proposed in [6]. Withthis heuristic, when a live register needs to be replaced, CF firstsearches in the clean registers. The clean register which containsthe value with the farthest next use is chosen. If there are no cleanregister, the most distant dirty one is chosen.We implemented a back-end compiler that uses the MIN andthe CF algorithms for register allocation. Next, we describe theimplementation details of our compiler.3.1 Implementations DetailsWe built a simple compiler that translates high-level code intoassembly code. Our compiler assumes that all the optimizations,including instruction scheduling, have been applied before thehigh-level code is generated. It only performs register allocationusing Belady’s MIN or the CF heuristic explained above.The compiler has two steps. In the first step we transform thelong straight-line code into a static single-assignment (SSA) formand build the definition-use chain for all the variables. At the secondstep we do register allocation as shown in Figure 2. The secondstep is fairly much along the lines of the simple algorithmdescribed in [1].The algorithm can be implemented more efficiently. The registerfile can be a priority queue implemented with a binary heap,where the higher priority is given to the farther next use. Operationssuch as extracting the register with the farthest next use canbe executed in Ç´ÐÓÊµ, where R is the number of registers. Sothe time complexity for MIN algorithm is × ¢ Ç´ÐÓÊµ, where ×is the number of references to the variables in the program.DATA STRUCTURES:register file:Array of registers. Each register r has 3 fields:- state: Indicates whether the register is clean.- var: The current variable in the register.- addr: The address of the variable in the register.next use list:List used to keep the stmt where each variable is used/defined.Each node in the list has 2 fields:- stmt: The statement number.- status: Definition or use.FUNCTIONS:NEXT USE (reg, stmt):Returns the statement number where reg.var is used next after stmt.MIN REG ALLOC (var, stmt):Returns the register that is going to be used for the variable var at stmt.MAINfor each stmt in the program do beginfor each v on RHS do beginreg MIN REG ALLOC (v, stmt)generate instruction ”load reg, reg.addr”endfor v on LHS do beginreg MIN REG ALLOC (v, stmt)endendgenerate assembly instruction “op r1 r2 r3”MIN REG ALLOC (v, stmt)if visinregreturn regfarthestUse 0regToUse 0for each reg in register file do beginif reg is emptyreturn regif variable in reg is deadreturn regnextUse NEXT USE (reg, stmt)if farthestUse nextUsefarthestUse nextUseregToUse regelse if farthestUse=nextUse and reg.state = CLEANregToUse regendif regToUse.state = DIRTYgenerate register spill “store regToUse, regToUse.addr”return regToUseFigure 2. Pseudocode for Belady’s MIN algorithm.

First spill1 t0 = x(1) - x(5)2 t1 = x(2) - x(6)3 t2 = x(1) + x(5)4 t3 = x(2) + x(6)5 t4 = x(3) - x(7)6 t5 = x(4) - x(8)7 t6 = x(3) + x(7)8 t7 = x(4) + x(8)9 y(5) = t2 - t610 y(6) = t3 - t7 First spill11 y(1) = t2 + t612 y(2) = t3 + t713 y(7) = t0 - t514 y(8) = t1 + t415 y(3) = t0 + t516 y(4) = t1 - t4(a)load $f0, 0($5)load $f1, 32($5)sub $f2, $f0, $f1load $f3, 8($5)load $f4, 40($5)sub $f5, $f3, $f4add $f0, $f0, $f1add $f1, $f3, $f4load $f3, 16($5)load $f4, 48($5)store $f5, 24($sp)sub $f5, $f3, $f4store $f5, 48($sp)load $f5, 24($5)store $f2, 16($sp)load $f2, 56($5)……Figure 3. An example of register allocation using Belady’s MINalgorithm. (a) Source code for FFT of size 4. (b) The resultingassembly code using Belady’s MIN algorithm.Figure 3 gives an example of how our compiler works usingthe MIN algorithm. The code in Figure 3-(a) corresponds to thecode for the FFT transform of size 4. It has been generated usingSPIRAL [20, 23]. Suppose the number of Floating Point (FP)registers is 6. At statement 5, a register needs to be replaced forthe first time. Table 1 gives a snapshot of the data structures atstatement 5. The variable t1 has the farthest next use. As a result,register $f5 is replaced. Since $f5 is dirty, there is a registerspill. Figure 3-(b) shows the assembly code after doing registerallocation for the code in Figure 3-(a).RegCurrentvariableStateAddrVariableDu-chainNext use$f0x(1)t2dirty40($sp)t2(3,d)(9,u)(11,u)9$f1x(5)t3dirty32($sp)t3(4,d)(10,u)(12,u)10$f2t0dirty16($sp)(a)t0(1,d)(13,d)(15,u)13(b)$f3x(2)x(3)clean16($5)x(3)(5,u)(7,u)7(b)$f4x(6)x(7)clean48($5)x(7)(5,u)(7,u)7$f5t1dirty24($sp)t1(2,d)(14,u)(16,u)Table 1. Important data structures used in the MIN algorithm.(a) The register file (b). The du-chain and next use at statement5.14Notice that our compiler follows the instruction scheduling embeddedin the source code; that is, our compiler schedules the arithmeticoperations in the same order as they appear in the sourcecode. Also, while most compilers do aggressive scheduling forload operations, ours does not. Compilers try to hoist loads a significantdistance above the register is used so that cache latencycan be hidden on a miss. However, our compiler loads values intothe registers inmediately before they are used. Although load hoistcan increase register pressure, some loads could be moved aheadof their use without increasing register pressure. This would simplyrequire an additional pass to our compiler. However, we havenot implemented that.We placed each long basic block in a procedure in a separatefile, and then we call our compiler to do the register allocationand generate the assembly code. When the generated code usesregisters that contain values from outside of the procedure, we savethem at the beginning of the procedure, and restore them at the end.4 Evaluation4.1 Environmental SetupIn this section, we compare our compiler against GCC andMIPSPro compilers and evaluate how well they behave on longstraight-line codes. For the evaluation, we have used the alreadyoptimized unrolled codes obtained from SPIRAL and ATLAS.SPIRAL produces Fortran codes, while ATLAS produces C code.Tabled 2 shows the version and flags that we used for MIPSProand GCC compilers. Our compiler is the one described in Section3.1 that implements MIN and CF algorithms. Remember thatour compiler schedules operations in the same order as they appearin the source code generated by ATLAS and SPIRAL. Both AT-LAS and SPIRAL perform some kind of instruction scheduling inthe source code. MIPSPro and GCC, however, rearranges the SPI-RAL or FORTRAN code. As a result, their instruction schedulingis different from ours.Appl. Compiler Version FlagsSPIRAL MIPSPro 7.3.2.1.m -OPT:Olimit=0 -03(Fortrancode) G77 3.2 -O3MIPSPro 7.3.1.1m -O3 -64-OPT:Olimit=15000ATLAS-TARG:platform=IP27(C code) -LN0:blocking=OFF-LOPT:alias=typedGCC 3.2 -fomit-frame-pointer -03Table 2. Compiler Version and Flags for MIPSPro and GCC.All the experiments were done on a MIPS R12000 processorwith a frequency of 270 MHz. The machine has 32 floating pointregisters, a L1 Instruction Cache of 32 KB, and a L1 Data Cache of32 KB. In all the experiments that use MIPSPro and our compiler,the code fits into the L1 Instruction cache, like the data fit into

the L1 Data Cache. However, in a few cases where the code wascompiled with GCC, it did not fit into the instruction cache (wepoint this out in the evaluation in next section). Finally, notice thatinteger registers are not a problem in the FFT or MM codes sincethe machine has 32 registers and we only need a few of them.Next, we study the effectiveness of our compiler on the longstraight-line code of FFT (Section 4.2) and MM (Section 4.3). Finally,in Section 4.4, we summarize our results.4.2 FFTThe FFT code that we use is the code generated by the SPIRALcompiler. In this section, we first study the characteristics of theFFT code generated by the SPIRAL compiler (Section 4.2.1) andthen we evaluate the performance (Section 4.2.2).4.2.1 SPIRAL and FFT CodeSPIRAL translates formulas representing signal processingtransforms into efficient Fortran programs. It uses intelligentsearch strategies to automatically generate optimized DSP libraries.In the case of FFT, SPIRAL first searches for a good implementationfor small-size transforms, 2 to 64, and then searchesfor a good implementation for larger size transforms that use thesmall-size results. For FFT sizes smaller than 64, SPIRAL foundthat straight-line code achieves the best performance since loopcontrol overheads are eliminated and all the temporary variablescan be scalars.To better understand the performance results, in next section wefirst study the patterns that appear in the FFT code. Some of thesepatterns are due to the way SPIRAL generates code, while othersare intrinsic to the FFT nature. Patterns that come from SPIRALare: 1) Each variable is defined only once; that is, every variableholds only one value during its lifetime. For that, SPIRAL usesrenaming. 2) If a variable has two uses, at most one statement isbetween the two uses of the variable. This means that the two usesof a variable are very close to each other. On the other side, patternsthat are intrinsic to FFT are: 3) Each variable is used twiceat most. 4) If two variables appear in a pair on the RHS of an expression,then they always appear in a pair, and they appear twice.Figure 3-(a) shows an example of FFT code of size 4, where thesepatterns are shown. In that example, array Ü is the input, array Ý isthe output, and Ø ¼ × are temporary variables. As it can be seen, theinput array Ü has two uses. The uses of Ü´µ or Ø variables alwaysappear in pairs, and there is only one statement between these twouses.This FFT code generated by SPIRAL is used as the input forour compiler. Therefore, given the proximity of the two uses ofeach variable in the SPIRAL code, any compiler would minimizeregister replacements by keeping the variable in the same registerduring the two uses. As a result, the two uses of a variable can beconsidered as a single use. Thus, the problem of register allocationfor the FFT code generated by SPIRAL can be simplified as theproblem of register allocation where each variable is defined onceand used once.Pseudo MFlops (5N log 2 N/time-inmicroseconds)7006005004003002001000MIN MIPSPro G774 8 16 32 64FFT sizeFigure 4. Performance of the best formula for FFTs 4 - 64.Based on this simplified model, register replacement only occursbetween the definition and the use of the variable. One consequenceis that the MIN and CF algorithm behave similarly and theyalways choose the same register to replace. In addition, since theMIN algorithm implemented on our compiler is known to producethe minimum number of register replacements, we can claim thatfor the FFT problem and given SPIRAL scheduling, our compilergenerates the optimal solution, the one with the minimum numberof loads and stores. In the next section we evaluate the performancedifferences between this optimal solution and the MIPSProor G77 compilers.4.2.2 Performance EvaluationSize Backend LOC spills reloads4 MIPSPro 34 0 0MIN 34 0 08 MIPSPro 90 0 0MIN 95 0 016 MIPSPro 266 9 14MIN 276 2 232 MIPSPro 921 150 212MIN 764 34 3464 MIPSPro 2468 552 606MIN 1944 112 112Table 3. Characteristics of the code that MIPSPro and MINgenerate for FFTs 4-64.SPIRAL does an exhaustive search to find the fastest formulafor a given platform. We studied the performance obtained by thebest formula when the code was compiled using the MIPSPro compiler,G77, or our compiler. Figure 4 shows the best performanceobtained for FFTs of size 4 - 64 using the MIPSPro compiler (MIP-SPro), the G77 compiler (G77), or our compiler (MIN). The performanceis measured in terms of ”pseudo MFlops”, which is thevalue computed by using the equation 5Nlog2N/t, where N is thesize of the FFT and t is the execution time in microseconds. Notice

that the formula that achieves the best performance can be differentin each case.We focus on MISPPro and MIN, since G77 is always muchslower. To help understand the results, Table 3 shows the numberof lines of the assembly code (LOC), spills, and reloads for eachpoint in Figure 4. A spill is a store of a value that needs to beloaded again later. A reload is a load of a value that previouslywas in a register. The data in Table 3 show that, using MIN, thenumber of spills and reloads is always the same. This is due toSPIRAL scheduling. As Table 3 shows, for FFTs of size 4 and 8,the 32 FP registers are enough to hold the values in the program,and as a result there is no register replacement. Thus, the differencein performance between MIPSPro and MIN comes from thedifferences in instruction scheduling. From FFTs of size 16, westart to see some spills and reloads, and MIN overcomes the effectsof instruction scheduling and obtains the same performanceas MIPSPro. Finally, for FFTs of size 32 and 64, since the amountof spilling is larger, the effect of instruction scheduling becomesless important, and MIN outperforms MIPSPro. MIN performs12% and 33% better than MIPSPro for FFTs of size 32 and 64respectively.Execution time in microseconds76.565.554.543.5MIPSProMIN31 26 51 76 101 126 151 176Different formulas for FFT of size 64Figure 5. Performance of the different formulas for FFTs of size64.In Figure 5, we show the execution time of several FFT codesof size 64 that SPIRAL produced using different formulas. Foreach formula we show two points. One corresponds to the performanceobtained when the SPIRAL code for that formula wascompiled using the MIPSpro compiler (MIPSPro), or using ourcompiler (MIN). On average, MIN runs 18% faster than MIPSPro.In addition, the figure shows that our compiler always performsbetter. As before, it seems to us that as register pressure increases,register allocation becomes the dominant factor.4.3 Matrix MultiplicationIn this section, we study the performance of register allocationfor the matrix multiplication code produced by ATLAS. We firstdescribe ATLAS (Section 4.3.1) and then present the performanceevaluation (Section 4.3.2).4.3.1 Overview of ATLASATLAS is an empirical optimizer whose structure is shown inFigure 6. ATLAS is composed of i) a Search Engine that performsempirical search of certain optimization parameters values and ii)a Code Generator that generates C code given these values. Thegenerated C code is compiled, executed, and its performance measured.The system keeps track of the values that produced the bestperformance, which will be used to generate a highly tuned library.ATLAS SearchENGINETile SizeUnroll valuesLatencyMuladdFetchMFLOPSATLAS MM CodeGeneratorMiniMMMC codeFigure 6. Empirical optimizer in ATLAS.Execute&MeasureCompilerFor the search process, ATLAS generates a matrix multiplicationof size Tile Size that we call MiniMMM. This code forthe MiniMMM is itself tiled to make better use of the registers.Each of these small matrix multiplications multiplies a MUx1 submatrixof A with a 1xNU submatrix of B, and accumulates the resultin a MUxNU sub-matrix of C. We call these micro-MMMs.Figure 7 shows a pictorial view and the pseudo-code correspondingto the mini-MMM after register tiling and unrolling of themicro-MMM. The codes of a micro-MMM are unrolled to producea straight-line of code. After the register tiling, the K loopin Figure 7 is unrolled by a factor KU. The result is a straight-linecode that contains KU copies of the micro-MMM code. Figure 9-(a) shows two copies of the micro-MMM that corresponds to theunrolls MU=4 and NU=2 shown in Figure 7. Notice that the degreeof unroll MUxNU determines the number of FP registers required.This number is MUxNU to hold the C values, MU to hold the Avalues, and NU to hold the B values.MUKANUKTile SizeBCTileSizefor (int j =0; j

For this experiment the rest of the parameters of the miniMMMhave been set to the values that ATLAS found to be optimal [24].In particular, TileSize and KU have been set to 64 1 ; that is, theinnermost k loop in Figure 7 is totally unrolled. Notice that for theorder that ATLAS uses for register blocking, while unrolling alongthe k dimension does reduces loop overheads, it does not increaseregister pressure.MFlopsMIPSPro MIN MINSched GCC500450400350300250200150100502x2 2x4 4x2 4x4 4x5 5x4 4x6 6x4 5x5 5x6 6x5 4x8 8x4 6x6 6x8 8x6 7x7 7x8 8x7 8x8Degree of unrollFigure 8. Performance versus MUxNU unroll for the mini-MMM.Figure 8 shows that, as before, GCC has the worst performance.In particular the sharp drop for unrolls 6x8 and larger are due tothe size of the code, that overflows the 32KB instruction cache ofthe MIPS R12000 processor. We now focus on MIPSPro. Figure 8also shows that MIN behaves almost like MIPSPro when the MUand NU are small and, as a result, there is no register replacement.Thus, the slightly better performance of MIN is mostly due to differencesin instruction scheduling. Register replacement occursfor unrolls 4x4 and larger in MIPSPro, and unrolls 4x5 and largerin MIN. Unfortunately, for unrolls 4x6, where there are more valuesthan registers and register replacement occurs, MIN performsworse than MIPSPro.The MIN algorithm performs worse because of the particularscheduling of the operations in the MM in ATLAS. The certaincode scheduling incurs a register spill for each replacement andadditional dependences. Figure 9-(a) shows the micro-MMM generatedby ATLAS for an unroll of 4x2 that is the input to our compiler.The code in Figure 9-(b) is the resulting assembly code afterour compiler does register allocation for the first few instructions.For the example we have assumed that we have only 6 FP registers.It can be seen that when register replacement starts (at statement3), the farthest next use is the variable ½ that we have justcomputed. For this particular scheduling, variables always havethe furthest next use. The registers holding the variables are indirty state and, as a result, its contents need to be written back tomemory. This results in an increase in memory traffic because, ateach register replacement, we have a register spill that generatesone ×ØÓÖ. In addition to the spills, the compiler introduces additionaldependencies by allocating the same register to independentinstructions (storage-related dependence). For instance, althoughall the Ñ× are independent, due to the farthest next use of the1 ATLAS only tries square tilesMIN algorithm, we always spill the register $f4. Thus, we havecreated a chain of dependences along the instructions using register$f4, as shown in Figure 9-(b). Thus, the performance of MINdecreases, as shown in Figure 8. We also tried our compiler usingthe CF heuristic, but the performance was even worse. TheCF heuristic always replaces the registers containing the A and Bvalues that tend to be needed shortly again.We looked then at the instruction scheduling in the MIPSProassembly code. Since the MIPSpro assembly code was the resultof instruction scheduling and register allocation, we extractedthe scheduling of the Ñ instructions in the MIPSpro assemblycode and obtained the code shown in Figure 9-(c). We ran ourcompiler on the code with the new scheduling to do the registerallocation. The resulting assembly code is shown in Figure 9-(d).We executed this code and the performance obtained is line MIN-Sched in Figure 8. Now for unrolls larger than 4x6, our compilerbehaves better. As Figure 9-(d) shows, with this new scheduling,the variables have a higher reuse rate. Since the registers containingthese variables are in dirty state, this new scheduling helpsto reduce register spilling.Table 4 helps to understand the results in Figure 8. For eachdegree of unroll, we show the number of lines of the assemblycode (LOC), spills, and reloads. Table 4 shows that MINSchedalways has fewer spills and reloads than MIPSPro. As the unrollgrows and register pressure becomes more prominent, the fewerspills and reloads result in the better performance of MINSched(Figure 8). On average, for unrolls larger than 4x6, MINSchedperforms 10% better than MIPSPro. Finally, we also tried the CFalgorithm with the new scheduling, but it performed worse thanMIN, so we do not show results for it.Unroll Compiler LOC spills reloadsMIPSPro 917 0 32x4 MIN 921 0 0MINSched 921 0 0MIPSPro 1684 48 644x4 MIN 1577 0 0MINSched 1604 9 18MIPSPro 4046 438 7094x8 MIN 4673 892 892MINSched 3607 356 362Table 4. Characteristics of the MM code for different degrees ofunroll.We have used the long straight-lines code in the MM in AT-LAS as an example of where we apply register allocation. We haveshown that MINSched performs better than MIPSPro for large degreesof unroll. However, this improvement is not useful. Thereason is that the unroll that obtained the best performance correspondsto the largest unroll before register replacement starts (thispoint is 4x4 in Figure 8). As a result, ATLAS will select this unroll,where register replacement heuristics are not used.

1 c0 += a0 x b02 c1 += a1 x b03 c2 += a2 x b04 c3 += a3 x b05 c64 += a0 x b646 c65 += a1 x b647 c66 += a2 x b648 c67 += a3 x b649 c0 += a64 x b110 c1 += a65 x b111 c2 += a66 x b112 c3 += a67 x b113 c64 += a64 x b6514 c65 += a65 x b6515 c66 += a67 x b6516 c67 += a68 x b65load $f0, 0($5)load $f1, 0($6)load $f2, 0($7)1 madd $f2, $f2, $f0, $f1load $f3, 8($5)load $f4, 8($7)2 madd $f4, $f4, $f3, $f1load $f5, 16($5) Firststore $f4, 8($7) spillload $f4, 16($7)3 madd $f4, $f4, $f5, $f1store $f4, 16($7)load $f4, 24($5)store $f2, 0($7)load $f2, 24($7)4 madd $f2, $f2, $f4, $f1. . .c0 += a0 x b0c0 += a64 x b1c1 += a65 x b1c1 += a1 x b0c2 += a2 x b0c3 += a3 x b0c2 += a66 x b1c3 += a67 x b1c66 += a2 x b64c64 += a0 x b64c66 += a67 x b65c64 += a64 x b65c65 += a1 x b64c67 += a3 x b64c67 += a68 x b65c65 += a65 x b65load $f0, 0($5)load $f1, 0($6)load $f2, 0($7)madd $f2, $f2, $f0, $f1load $f3, 512($5)load $f4, 8($6)madd $f2, $f2, $f3, $f4load $f5, 520($5)store $f2, 0($7)load $f2, 8($7)madd $f2, $f2, $f5, $f4load $f5, 8($5)madd $f2, $f2, $f5, $f1. . .(a) Unscheduled code(b) MIN algorithm on theunscheduled code(c) Code schedulingapplied by MIPSPro(d) MIN algorithm after thescheduling applied by MIPSPro4.4 AnalysisFigure 9. Two micro-MMM codes for a miniMMM of size 64, MU=4 and NU=2.Next we summarize our results. When the straight-line code issuch that the number of simultaneously live values is smaller thanthe number of registers, there is no need to do register replacement.In that case, instruction scheduling is the dominant factor in optimization.This is the case for FFTs of size 4 and 8, as well as forMM, because it is possible to search for an unroll without registerspilling or reloads.When the number of simultaneously live values is larger thanthe number of registers, register replacement becomes important.In FFT. we observed that as register pressure increases, register allocationbecomes more important than instruction scheduling. ForFFTs of size 32 and 64, where the number of spills and reloadsis larger, register allocation becomes important, and our compilerachieves a higher performance. The higher performance of ourcompiler also can be due to the use of the SPIRAL scheduling togetherwith the MIN algorithm, which result in a optimal registerallocation for that scheduling.On the other hand, for unrolls larger than 4x6, when registerpressure was high for the MM code, the use of ATLAS schedulingand the MIN algorithm resulted in additional dependences. As aresult, MIPSPro performed better than our compiler. It is unclearto us whether, by using the scheduling in ATLAS, we could havefound an optimal register replacement able to beat the MIPSProinstruction scheduling. However, it is clear that there are schedulingsthat can reduce register pressure, and these schedules shouldbe used when register spills and reloads become important.In summary, by using our compiler with the simple MIN algorithmwe have beaten the performance obtained by the MIPSProcompiler in long straight-lines of code, when register pressure washigh. Today’s compilers like MIPSPro or GCC are not optimizedto handle this type of codes and, as a result, highly optimized codelike those with loop unrolling and trace scheduling could result insub-optimal performance. Performance could be improved with anappropriate register allocator, and maybe an instruction schedulingchosen to minimize register pressure.5 Related WorkThere are two different approaches for register allocation:global register allocation, and local register allocation. Global registerallocation assigns variables to registers throughout the program,while local register allocation assigns registers to variableswithin basic blocks.The most commonly used register allocation is based on GraphColoring [5]. Graph coloring is a global register allocator thattries to assign registers so that simultaneously live values are notassigned the same register. In this approach, the register allocationproblem is translated into a graph coloring problem. Nodesin the graph represent live values that are candidates for registerallocation. Edges connect live ranges that interfere, that is, thatsimultaneously live at least at one point in the program. Coloringthe graph consists of assigning colors (registers) to the nodesso that two nodes connected by an edge do not receive the samecolor. Coloring works well when the graph is colorable. However,when the number of values is larger than the number of registers,some registers need to be spilled. Since coloring and spilling is anNP-complete problem[9], some heuristics are used to pick up thecorrect register to spill [10, 4, 3, 5]. These heuristics work well inmost programs because humans tend to write programs will smallprocedures and basic blocks where register spilling is unlikely tohappen [14]. However, the work in [16] showed that the effectivenessof graph coloring is strongly affected as the size of the basicblock increases.

However, as shown in this paper, compiler optimization canresult in long basic blocks where register spilling becomes necessary.In those cases, local register algorithms should performbetter. Local register allocation is the task of assigning values toregisters over an entire block of straigh-line code so that the trafficbetween registers and memory is minimized. Belady’s MIN [2]and Horwitz [13] are often treated as optimal algorithms for registerallocation in a basic block. Belady’s MIN [2] optimizes forthe minimal number of register replacements, and not for the minimumnumber of load/stores. As a result, it may not find the optimalsolution. Horwitz’s algorithm minimizes the number of loads andstores. Later algorithms [14, 15, 17] are mainly improvements tothe compilation efficiency. However, they are still exponential intime and space. On the other hand, Belady’s MIN algorithm runsin polynomial time, although it may not find the optimal solution.Finally, Linear Scan [18, 19, 21] is a type of global register allocationthat has become of interest lately. Its interest is that it isa fast technique that results in efficient code and, as a result, canbe used in dynamic compilation systems and “just-in-time” compilers.Linear Scan can be seen as an extension of local register allocationalgorithms [7, 8, 14], which in turn derive from Belady’sMIN.The problem of register allocation and instruction scheduling instraight-line code has also been studied in the literatur. In particularGoodman [11] proposes two different scheduling algorithms:one tries to minimize pipeline stalls, and the other one tries to reduceregister pressure. The algorithm is chosen based on the registerpressure. This agrees with our observation in section 4.4.6 ConclusionIn this paper, we have shown that a simple algorithm like Belady’sMIN can beat the performance of state-of the art compilerslike the MIPSPro or GCC compilers in long straight-line codes.We have applied Belady’s MIN algorithm to codes correspondingto FFTs transforms and Matrix Multiplication that are produced bySPIRAL and ATLAS, respectively. We have measured the performanceby running these codes on a real machine (a MIPS R12000processor).Our results show that Belady’s MIN algorithm is about 12%and 33% faster for FFTs of size 32 and 64. In the case of MatrixMultiplication, it can also execute faster than the the MIPSProcompiler by an average 10%. However, in this application, theunroll that achieves the best performance is the one without registerspilling. Our compiler and MIPSPro perform similarly usingthis unroll. Our experiments show, that when the number of livevariables is smaller than the number of registers, MIPSPro and ourcompiler have similar performance. However, as the number oflive variables increases, register allocation seems to become moreimportant. We believe that, in this case of high register pressure,instruction scheduling needs to be considered in concert with registerallocation so that the number of register spills and reloads canbe minimized.References[1] A. Aho, R. Sethi, and J. D. Ullman. Compilers, Principles, Techniques,and Tools. Addison-Wesley Publishing Comapny, 1985.[2] L. Belady. A Study of Replacement of Algorithms for a Virtual StorageComputer. IBM Systems Journal, 5(2):78–101, 1966.[3] P. Bergner, P. Dahl, D. Engebretsen, and M. T. O’Keefe. Spill codeminimization via interference region spilling. In SIGPLAN Conferenceon Programming Language Design and Implementation, pages287–295, 1997.[4] P. Briggs, K. Cooper, and L. Torczon. Improvements to Graph ColoringRegister Allocation. ACM Transactions on Programming Languagesand Systems, 6(3):428–455, May,1994.[5] G. Chaitin. Register Allocation and Spilling Via Graph Coloring . InProc. of the SIGPLAN 82 Symp. On Compiler Construction, pages98–105, 1982.[6] C. Fischer and T. LeBlanc. Crafting a Compiler. Benjamin Cummings,1987.[7] C. Fraser and D. Hanson. A Retargetable C Compiler: Design andImplementation . Benjamin/Cummings, Redwood City, CA, 1995.[8] R. A. Freiburghouse. Register allocation via usage counts. Communicationsof the ACM, 17:638–642, November,1974.[9] M. Garey and D. Johnson. Computers and Intracdtability: A Guideto the Theory of NP-Completeness. W.H. Freeman and Company,New York, 1989.[10] L. George and L. Appel. Iterated Register Coalescing. ACM Transactionson Programming Languages and Systems, 18(3):300–324,May,1996.[11] J. R. Goodman and W.-C. Hsu. Code scheduling and register allocationin large basic blocks. In Proceedings of the 2nd internationalconference on Supercomputing, pages 442–452. ACM Press, 1988.[12] J. L. Hennessy and D. A. Patterson. Computer Architecture: A QuantitativeApproach. Morgan Kaufmann Publishers, San Francisco, CA,1996.[13] L. Horwitz, R. M. karp, R. E. Miller, and S. Winograd. Index RegisterAllocation. Jornal of the ACM, 13(1):43–61, January,1966.[14] W. Hsu, C. Fischer, and J. Goodman. On the Minimization ofLoad/Stores in Local Register Allocation. IEEE Transactions onSoftware Engineering, 15(10):1252–1260, October, 1989.[15] K. Kennedy. Index Register Allocation in Straight Line Code andSimple Loops. Design and Optimization of Compilers, EnglewoodCliffs, NJ: Prentice Hall, 1972.[16] J. R. Larus and P. N. Hilfinger. Register allocation in the spur lispcompiler. In Proceedings of the 1986 SIGPLAN symposium on Compilercontruction, pages 255–263. ACM Press, 1986.[17] F. Luccio. A Comment on Index Register Allocation. Communicationsof the ACM, 10(9):572–574, 1967.[18] M. Poletto, D. Engler, and M. Kaashoek. tcc: A system for fast,flexible and high-level dynamic code generation. In Proc. of the InternationalConference on Programming Language Design and Implementation,pages 109–121, 1997.[19] M. Poletto and V. Sarkar. Linear scan register allocation. ACMTransactions on Programming Languages and Systems, 21:895–913,September,1999.

[20] M. Puschel, B. Singer, J. Xiong, J. Moura, D. Padua, M. Veloso, andR. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries ofSignal Processing Algorithms. To appear in Journal of High Performancecomputing and Applications. http://www.ece.cmu.edu/ spiral.[21] O. Traub, G. Holloway, and M. Smith. Quality and speed in linearscanregister allocation. In Proc. of the International Conference onProgramming Language Design and Implementation, 1998.[22] R. Whaley and J. Dongarra. Automatically Tuned Linear AlgebraSoftware. Technical Report UT CS-97-366, LAPACK Workig NoteNo. 131, University of Tenbessee, 1997.[23] J. Xiong, J. Johnson, R. Johnson, and D. Padua. Spl: A language anda compiler for dsp algorithms. In Proc. of the International Conferenceon Programming Language Design and Implementation, pages298–308, 2001.[24] K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran,D. Padua, K. Pingali, P. Stodghill, , and P. Wu. A comparison of empiricaland model-driven optimization. In Proc. of the InternationalConference on Programming Language Design and Implementation,pages 63–76, 2003.

The Power of Belady's Algorithm in Register Allocation for Long ...

Create successful ePaper yourself

Delete template?

Save as template?