13.07.2015 Views

The Power of Belady's Algorithm in Register Allocation for Long ...

The Power of Belady's Algorithm in Register Allocation for Long ...

The Power of Belady's Algorithm in Register Allocation for Long ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

quently. In the next section, we expla<strong>in</strong> the compiler we designedto per<strong>for</strong>m the register allocation on the FFT and MM codes.3 A Compiler <strong>for</strong> <strong>Long</strong> Straight-l<strong>in</strong>e CodeBelady’s MIN [2] is a replacement algorithm <strong>for</strong> pages <strong>in</strong> virtualmemory. On a replacement, the page to replace is the one withthe farthest next use. <strong>The</strong> MIN algorithm is optimal because itgenerates the m<strong>in</strong>imal number <strong>of</strong> physical memory block replacements.However, <strong>in</strong> this context <strong>of</strong> virtual memory the MIN algorithmis impractical <strong>in</strong> most <strong>of</strong> the applications, because it isusually not known which memory block will be referenced <strong>in</strong> thefuture.Belady’s MIN algorithm has been proposed to use <strong>for</strong> registerallocation <strong>in</strong> long basic blocks, where the compiler knows exactlythe values that will be used <strong>in</strong> the future. In this context, the MINalgorithm is also known as Farthest First(FF) [6] s<strong>in</strong>ce, on a registerreplacement, the register to replace first is the one hold<strong>in</strong>g thevalue with the farthest next use.<strong>The</strong> MIN algorithm is not optimal <strong>for</strong> register allocation s<strong>in</strong>cethe replacement decision is simply based on the distance and noton whether the register has been modified. When a register holdsa value that is not consistent with the value <strong>in</strong> memory, we say thatthe register is dirty. Otherwise, we say that the register is clean. Ifthe register to be replaced is dirty, the value <strong>of</strong> the register needs tobe stored back to memory be<strong>for</strong>e a new value can be loaded <strong>in</strong>to it.Thus, <strong>for</strong> a given <strong>in</strong>struction schedul<strong>in</strong>g the MIN algorithm guaranteesthe m<strong>in</strong>imum number <strong>of</strong> register replacements, but it doesnot guarantee the m<strong>in</strong>imum traffic with memory, that is, the m<strong>in</strong>imumnumber <strong>of</strong> load/stores. In our implementation, when thereare several candidates <strong>for</strong> replacement with the same distance, ourcompiler chooses the one with the clean state to avoid an extrastore.In order to further reduce the number <strong>of</strong> stores, another simpleheuristic called Clean First (CF) was proposed <strong>in</strong> [6]. Withthis heuristic, when a live register needs to be replaced, CF firstsearches <strong>in</strong> the clean registers. <strong>The</strong> clean register which conta<strong>in</strong>sthe value with the farthest next use is chosen. If there are no cleanregister, the most distant dirty one is chosen.We implemented a back-end compiler that uses the MIN andthe CF algorithms <strong>for</strong> register allocation. Next, we describe theimplementation details <strong>of</strong> our compiler.3.1 Implementations DetailsWe built a simple compiler that translates high-level code <strong>in</strong>toassembly code. Our compiler assumes that all the optimizations,<strong>in</strong>clud<strong>in</strong>g <strong>in</strong>struction schedul<strong>in</strong>g, have been applied be<strong>for</strong>e thehigh-level code is generated. It only per<strong>for</strong>ms register allocationus<strong>in</strong>g Belady’s MIN or the CF heuristic expla<strong>in</strong>ed above.<strong>The</strong> compiler has two steps. In the first step we trans<strong>for</strong>m thelong straight-l<strong>in</strong>e code <strong>in</strong>to a static s<strong>in</strong>gle-assignment (SSA) <strong>for</strong>mand build the def<strong>in</strong>ition-use cha<strong>in</strong> <strong>for</strong> all the variables. At the secondstep we do register allocation as shown <strong>in</strong> Figure 2. <strong>The</strong> secondstep is fairly much along the l<strong>in</strong>es <strong>of</strong> the simple algorithmdescribed <strong>in</strong> [1].<strong>The</strong> algorithm can be implemented more efficiently. <strong>The</strong> registerfile can be a priority queue implemented with a b<strong>in</strong>ary heap,where the higher priority is given to the farther next use. Operationssuch as extract<strong>in</strong>g the register with the farthest next use canbe executed <strong>in</strong> Ç´ÐÓʵ, where R is the number <strong>of</strong> registers. Sothe time complexity <strong>for</strong> MIN algorithm is × ¢ Ç´ÐÓʵ, where ×is the number <strong>of</strong> references to the variables <strong>in</strong> the program.DATA STRUCTURES:register file:Array <strong>of</strong> registers. Each register r has 3 fields:- state: Indicates whether the register is clean.- var: <strong>The</strong> current variable <strong>in</strong> the register.- addr: <strong>The</strong> address <strong>of</strong> the variable <strong>in</strong> the register.next use list:List used to keep the stmt where each variable is used/def<strong>in</strong>ed.Each node <strong>in</strong> the list has 2 fields:- stmt: <strong>The</strong> statement number.- status: Def<strong>in</strong>ition or use.FUNCTIONS:NEXT USE (reg, stmt):Returns the statement number where reg.var is used next after stmt.MIN REG ALLOC (var, stmt):Returns the register that is go<strong>in</strong>g to be used <strong>for</strong> the variable var at stmt.MAIN<strong>for</strong> each stmt <strong>in</strong> the program do beg<strong>in</strong><strong>for</strong> each v on RHS do beg<strong>in</strong>reg MIN REG ALLOC (v, stmt)generate <strong>in</strong>struction ”load reg, reg.addr”end<strong>for</strong> v on LHS do beg<strong>in</strong>reg MIN REG ALLOC (v, stmt)endendgenerate assembly <strong>in</strong>struction “op r1 r2 r3”MIN REG ALLOC (v, stmt)if vis<strong>in</strong>regreturn regfarthestUse 0regToUse 0<strong>for</strong> each reg <strong>in</strong> register file do beg<strong>in</strong>if reg is emptyreturn regif variable <strong>in</strong> reg is deadreturn regnextUse NEXT USE (reg, stmt)if farthestUse nextUsefarthestUse nextUseregToUse regelse if farthestUse=nextUse and reg.state = CLEANregToUse regendif regToUse.state = DIRTYgenerate register spill “store regToUse, regToUse.addr”return regToUseFigure 2. Pseudocode <strong>for</strong> Belady’s MIN algorithm.


First spill1 t0 = x(1) - x(5)2 t1 = x(2) - x(6)3 t2 = x(1) + x(5)4 t3 = x(2) + x(6)5 t4 = x(3) - x(7)6 t5 = x(4) - x(8)7 t6 = x(3) + x(7)8 t7 = x(4) + x(8)9 y(5) = t2 - t610 y(6) = t3 - t7 First spill11 y(1) = t2 + t612 y(2) = t3 + t713 y(7) = t0 - t514 y(8) = t1 + t415 y(3) = t0 + t516 y(4) = t1 - t4(a)load $f0, 0($5)load $f1, 32($5)sub $f2, $f0, $f1load $f3, 8($5)load $f4, 40($5)sub $f5, $f3, $f4add $f0, $f0, $f1add $f1, $f3, $f4load $f3, 16($5)load $f4, 48($5)store $f5, 24($sp)sub $f5, $f3, $f4store $f5, 48($sp)load $f5, 24($5)store $f2, 16($sp)load $f2, 56($5)……Figure 3. An example <strong>of</strong> register allocation us<strong>in</strong>g Belady’s MINalgorithm. (a) Source code <strong>for</strong> FFT <strong>of</strong> size 4. (b) <strong>The</strong> result<strong>in</strong>gassembly code us<strong>in</strong>g Belady’s MIN algorithm.Figure 3 gives an example <strong>of</strong> how our compiler works us<strong>in</strong>gthe MIN algorithm. <strong>The</strong> code <strong>in</strong> Figure 3-(a) corresponds to thecode <strong>for</strong> the FFT trans<strong>for</strong>m <strong>of</strong> size 4. It has been generated us<strong>in</strong>gSPIRAL [20, 23]. Suppose the number <strong>of</strong> Float<strong>in</strong>g Po<strong>in</strong>t (FP)registers is 6. At statement 5, a register needs to be replaced <strong>for</strong>the first time. Table 1 gives a snapshot <strong>of</strong> the data structures atstatement 5. <strong>The</strong> variable t1 has the farthest next use. As a result,register $f5 is replaced. S<strong>in</strong>ce $f5 is dirty, there is a registerspill. Figure 3-(b) shows the assembly code after do<strong>in</strong>g registerallocation <strong>for</strong> the code <strong>in</strong> Figure 3-(a).RegCurrentvariableStateAddrVariableDu-cha<strong>in</strong>Next use$f0x(1)t2dirty40($sp)t2(3,d)(9,u)(11,u)9$f1x(5)t3dirty32($sp)t3(4,d)(10,u)(12,u)10$f2t0dirty16($sp)(a)t0(1,d)(13,d)(15,u)13(b)$f3x(2)x(3)clean16($5)x(3)(5,u)(7,u)7(b)$f4x(6)x(7)clean48($5)x(7)(5,u)(7,u)7$f5t1dirty24($sp)t1(2,d)(14,u)(16,u)Table 1. Important data structures used <strong>in</strong> the MIN algorithm.(a) <strong>The</strong> register file (b). <strong>The</strong> du-cha<strong>in</strong> and next use at statement5.14Notice that our compiler follows the <strong>in</strong>struction schedul<strong>in</strong>g embedded<strong>in</strong> the source code; that is, our compiler schedules the arithmeticoperations <strong>in</strong> the same order as they appear <strong>in</strong> the sourcecode. Also, while most compilers do aggressive schedul<strong>in</strong>g <strong>for</strong>load operations, ours does not. Compilers try to hoist loads a significantdistance above the register is used so that cache latencycan be hidden on a miss. However, our compiler loads values <strong>in</strong>tothe registers <strong>in</strong>mediately be<strong>for</strong>e they are used. Although load hoistcan <strong>in</strong>crease register pressure, some loads could be moved ahead<strong>of</strong> their use without <strong>in</strong>creas<strong>in</strong>g register pressure. This would simplyrequire an additional pass to our compiler. However, we havenot implemented that.We placed each long basic block <strong>in</strong> a procedure <strong>in</strong> a separatefile, and then we call our compiler to do the register allocationand generate the assembly code. When the generated code usesregisters that conta<strong>in</strong> values from outside <strong>of</strong> the procedure, we savethem at the beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> the procedure, and restore them at the end.4 Evaluation4.1 Environmental SetupIn this section, we compare our compiler aga<strong>in</strong>st GCC andMIPSPro compilers and evaluate how well they behave on longstraight-l<strong>in</strong>e codes. For the evaluation, we have used the alreadyoptimized unrolled codes obta<strong>in</strong>ed from SPIRAL and ATLAS.SPIRAL produces Fortran codes, while ATLAS produces C code.Tabled 2 shows the version and flags that we used <strong>for</strong> MIPSProand GCC compilers. Our compiler is the one described <strong>in</strong> Section3.1 that implements MIN and CF algorithms. Remember thatour compiler schedules operations <strong>in</strong> the same order as they appear<strong>in</strong> the source code generated by ATLAS and SPIRAL. Both AT-LAS and SPIRAL per<strong>for</strong>m some k<strong>in</strong>d <strong>of</strong> <strong>in</strong>struction schedul<strong>in</strong>g <strong>in</strong>the source code. MIPSPro and GCC, however, rearranges the SPI-RAL or FORTRAN code. As a result, their <strong>in</strong>struction schedul<strong>in</strong>gis different from ours.Appl. Compiler Version FlagsSPIRAL MIPSPro 7.3.2.1.m -OPT:Olimit=0 -03(Fortrancode) G77 3.2 -O3MIPSPro 7.3.1.1m -O3 -64-OPT:Olimit=15000ATLAS-TARG:plat<strong>for</strong>m=IP27(C code) -LN0:block<strong>in</strong>g=OFF-LOPT:alias=typedGCC 3.2 -fomit-frame-po<strong>in</strong>ter -03Table 2. Compiler Version and Flags <strong>for</strong> MIPSPro and GCC.All the experiments were done on a MIPS R12000 processorwith a frequency <strong>of</strong> 270 MHz. <strong>The</strong> mach<strong>in</strong>e has 32 float<strong>in</strong>g po<strong>in</strong>tregisters, a L1 Instruction Cache <strong>of</strong> 32 KB, and a L1 Data Cache <strong>of</strong>32 KB. In all the experiments that use MIPSPro and our compiler,the code fits <strong>in</strong>to the L1 Instruction cache, like the data fit <strong>in</strong>to


the L1 Data Cache. However, <strong>in</strong> a few cases where the code wascompiled with GCC, it did not fit <strong>in</strong>to the <strong>in</strong>struction cache (wepo<strong>in</strong>t this out <strong>in</strong> the evaluation <strong>in</strong> next section). F<strong>in</strong>ally, notice that<strong>in</strong>teger registers are not a problem <strong>in</strong> the FFT or MM codes s<strong>in</strong>cethe mach<strong>in</strong>e has 32 registers and we only need a few <strong>of</strong> them.Next, we study the effectiveness <strong>of</strong> our compiler on the longstraight-l<strong>in</strong>e code <strong>of</strong> FFT (Section 4.2) and MM (Section 4.3). F<strong>in</strong>ally,<strong>in</strong> Section 4.4, we summarize our results.4.2 FFT<strong>The</strong> FFT code that we use is the code generated by the SPIRALcompiler. In this section, we first study the characteristics <strong>of</strong> theFFT code generated by the SPIRAL compiler (Section 4.2.1) andthen we evaluate the per<strong>for</strong>mance (Section 4.2.2).4.2.1 SPIRAL and FFT CodeSPIRAL translates <strong>for</strong>mulas represent<strong>in</strong>g signal process<strong>in</strong>gtrans<strong>for</strong>ms <strong>in</strong>to efficient Fortran programs. It uses <strong>in</strong>telligentsearch strategies to automatically generate optimized DSP libraries.In the case <strong>of</strong> FFT, SPIRAL first searches <strong>for</strong> a good implementation<strong>for</strong> small-size trans<strong>for</strong>ms, 2 to 64, and then searches<strong>for</strong> a good implementation <strong>for</strong> larger size trans<strong>for</strong>ms that use thesmall-size results. For FFT sizes smaller than 64, SPIRAL foundthat straight-l<strong>in</strong>e code achieves the best per<strong>for</strong>mance s<strong>in</strong>ce loopcontrol overheads are elim<strong>in</strong>ated and all the temporary variablescan be scalars.To better understand the per<strong>for</strong>mance results, <strong>in</strong> next section wefirst study the patterns that appear <strong>in</strong> the FFT code. Some <strong>of</strong> thesepatterns are due to the way SPIRAL generates code, while othersare <strong>in</strong>tr<strong>in</strong>sic to the FFT nature. Patterns that come from SPIRALare: 1) Each variable is def<strong>in</strong>ed only once; that is, every variableholds only one value dur<strong>in</strong>g its lifetime. For that, SPIRAL usesrenam<strong>in</strong>g. 2) If a variable has two uses, at most one statement isbetween the two uses <strong>of</strong> the variable. This means that the two uses<strong>of</strong> a variable are very close to each other. On the other side, patternsthat are <strong>in</strong>tr<strong>in</strong>sic to FFT are: 3) Each variable is used twiceat most. 4) If two variables appear <strong>in</strong> a pair on the RHS <strong>of</strong> an expression,then they always appear <strong>in</strong> a pair, and they appear twice.Figure 3-(a) shows an example <strong>of</strong> FFT code <strong>of</strong> size 4, where thesepatterns are shown. In that example, array Ü is the <strong>in</strong>put, array Ý isthe output, and Ø ¼ × are temporary variables. As it can be seen, the<strong>in</strong>put array Ü has two uses. <strong>The</strong> uses <strong>of</strong> Ü´µ or Ø variables alwaysappear <strong>in</strong> pairs, and there is only one statement between these twouses.This FFT code generated by SPIRAL is used as the <strong>in</strong>put <strong>for</strong>our compiler. <strong>The</strong>re<strong>for</strong>e, given the proximity <strong>of</strong> the two uses <strong>of</strong>each variable <strong>in</strong> the SPIRAL code, any compiler would m<strong>in</strong>imizeregister replacements by keep<strong>in</strong>g the variable <strong>in</strong> the same registerdur<strong>in</strong>g the two uses. As a result, the two uses <strong>of</strong> a variable can beconsidered as a s<strong>in</strong>gle use. Thus, the problem <strong>of</strong> register allocation<strong>for</strong> the FFT code generated by SPIRAL can be simplified as theproblem <strong>of</strong> register allocation where each variable is def<strong>in</strong>ed onceand used once.Pseudo MFlops (5N log 2 N/time-<strong>in</strong>microseconds)7006005004003002001000MIN MIPSPro G774 8 16 32 64FFT sizeFigure 4. Per<strong>for</strong>mance <strong>of</strong> the best <strong>for</strong>mula <strong>for</strong> FFTs 4 - 64.Based on this simplified model, register replacement only occursbetween the def<strong>in</strong>ition and the use <strong>of</strong> the variable. One consequenceis that the MIN and CF algorithm behave similarly and theyalways choose the same register to replace. In addition, s<strong>in</strong>ce theMIN algorithm implemented on our compiler is known to producethe m<strong>in</strong>imum number <strong>of</strong> register replacements, we can claim that<strong>for</strong> the FFT problem and given SPIRAL schedul<strong>in</strong>g, our compilergenerates the optimal solution, the one with the m<strong>in</strong>imum number<strong>of</strong> loads and stores. In the next section we evaluate the per<strong>for</strong>mancedifferences between this optimal solution and the MIPSProor G77 compilers.4.2.2 Per<strong>for</strong>mance EvaluationSize Backend LOC spills reloads4 MIPSPro 34 0 0MIN 34 0 08 MIPSPro 90 0 0MIN 95 0 016 MIPSPro 266 9 14MIN 276 2 232 MIPSPro 921 150 212MIN 764 34 3464 MIPSPro 2468 552 606MIN 1944 112 112Table 3. Characteristics <strong>of</strong> the code that MIPSPro and MINgenerate <strong>for</strong> FFTs 4-64.SPIRAL does an exhaustive search to f<strong>in</strong>d the fastest <strong>for</strong>mula<strong>for</strong> a given plat<strong>for</strong>m. We studied the per<strong>for</strong>mance obta<strong>in</strong>ed by thebest <strong>for</strong>mula when the code was compiled us<strong>in</strong>g the MIPSPro compiler,G77, or our compiler. Figure 4 shows the best per<strong>for</strong>manceobta<strong>in</strong>ed <strong>for</strong> FFTs <strong>of</strong> size 4 - 64 us<strong>in</strong>g the MIPSPro compiler (MIP-SPro), the G77 compiler (G77), or our compiler (MIN). <strong>The</strong> per<strong>for</strong>manceis measured <strong>in</strong> terms <strong>of</strong> ”pseudo MFlops”, which is thevalue computed by us<strong>in</strong>g the equation 5Nlog2N/t, where N is thesize <strong>of</strong> the FFT and t is the execution time <strong>in</strong> microseconds. Notice


that the <strong>for</strong>mula that achieves the best per<strong>for</strong>mance can be different<strong>in</strong> each case.We focus on MISPPro and MIN, s<strong>in</strong>ce G77 is always muchslower. To help understand the results, Table 3 shows the number<strong>of</strong> l<strong>in</strong>es <strong>of</strong> the assembly code (LOC), spills, and reloads <strong>for</strong> eachpo<strong>in</strong>t <strong>in</strong> Figure 4. A spill is a store <strong>of</strong> a value that needs to beloaded aga<strong>in</strong> later. A reload is a load <strong>of</strong> a value that previouslywas <strong>in</strong> a register. <strong>The</strong> data <strong>in</strong> Table 3 show that, us<strong>in</strong>g MIN, thenumber <strong>of</strong> spills and reloads is always the same. This is due toSPIRAL schedul<strong>in</strong>g. As Table 3 shows, <strong>for</strong> FFTs <strong>of</strong> size 4 and 8,the 32 FP registers are enough to hold the values <strong>in</strong> the program,and as a result there is no register replacement. Thus, the difference<strong>in</strong> per<strong>for</strong>mance between MIPSPro and MIN comes from thedifferences <strong>in</strong> <strong>in</strong>struction schedul<strong>in</strong>g. From FFTs <strong>of</strong> size 16, westart to see some spills and reloads, and MIN overcomes the effects<strong>of</strong> <strong>in</strong>struction schedul<strong>in</strong>g and obta<strong>in</strong>s the same per<strong>for</strong>manceas MIPSPro. F<strong>in</strong>ally, <strong>for</strong> FFTs <strong>of</strong> size 32 and 64, s<strong>in</strong>ce the amount<strong>of</strong> spill<strong>in</strong>g is larger, the effect <strong>of</strong> <strong>in</strong>struction schedul<strong>in</strong>g becomesless important, and MIN outper<strong>for</strong>ms MIPSPro. MIN per<strong>for</strong>ms12% and 33% better than MIPSPro <strong>for</strong> FFTs <strong>of</strong> size 32 and 64respectively.Execution time <strong>in</strong> microseconds76.565.554.543.5MIPSProMIN31 26 51 76 101 126 151 176Different <strong>for</strong>mulas <strong>for</strong> FFT <strong>of</strong> size 64Figure 5. Per<strong>for</strong>mance <strong>of</strong> the different <strong>for</strong>mulas <strong>for</strong> FFTs <strong>of</strong> size64.In Figure 5, we show the execution time <strong>of</strong> several FFT codes<strong>of</strong> size 64 that SPIRAL produced us<strong>in</strong>g different <strong>for</strong>mulas. Foreach <strong>for</strong>mula we show two po<strong>in</strong>ts. One corresponds to the per<strong>for</strong>manceobta<strong>in</strong>ed when the SPIRAL code <strong>for</strong> that <strong>for</strong>mula wascompiled us<strong>in</strong>g the MIPSpro compiler (MIPSPro), or us<strong>in</strong>g ourcompiler (MIN). On average, MIN runs 18% faster than MIPSPro.In addition, the figure shows that our compiler always per<strong>for</strong>msbetter. As be<strong>for</strong>e, it seems to us that as register pressure <strong>in</strong>creases,register allocation becomes the dom<strong>in</strong>ant factor.4.3 Matrix MultiplicationIn this section, we study the per<strong>for</strong>mance <strong>of</strong> register allocation<strong>for</strong> the matrix multiplication code produced by ATLAS. We firstdescribe ATLAS (Section 4.3.1) and then present the per<strong>for</strong>manceevaluation (Section 4.3.2).4.3.1 Overview <strong>of</strong> ATLASATLAS is an empirical optimizer whose structure is shown <strong>in</strong>Figure 6. ATLAS is composed <strong>of</strong> i) a Search Eng<strong>in</strong>e that per<strong>for</strong>msempirical search <strong>of</strong> certa<strong>in</strong> optimization parameters values and ii)a Code Generator that generates C code given these values. <strong>The</strong>generated C code is compiled, executed, and its per<strong>for</strong>mance measured.<strong>The</strong> system keeps track <strong>of</strong> the values that produced the bestper<strong>for</strong>mance, which will be used to generate a highly tuned library.ATLAS SearchENGINETile SizeUnroll valuesLatencyMuladdFetchMFLOPSATLAS MM CodeGeneratorM<strong>in</strong>iMMMC codeFigure 6. Empirical optimizer <strong>in</strong> ATLAS.Execute&MeasureCompilerFor the search process, ATLAS generates a matrix multiplication<strong>of</strong> size Tile Size that we call M<strong>in</strong>iMMM. This code <strong>for</strong>the M<strong>in</strong>iMMM is itself tiled to make better use <strong>of</strong> the registers.Each <strong>of</strong> these small matrix multiplications multiplies a MUx1 submatrix<strong>of</strong> A with a 1xNU submatrix <strong>of</strong> B, and accumulates the result<strong>in</strong> a MUxNU sub-matrix <strong>of</strong> C. We call these micro-MMMs.Figure 7 shows a pictorial view and the pseudo-code correspond<strong>in</strong>gto the m<strong>in</strong>i-MMM after register til<strong>in</strong>g and unroll<strong>in</strong>g <strong>of</strong> themicro-MMM. <strong>The</strong> codes <strong>of</strong> a micro-MMM are unrolled to producea straight-l<strong>in</strong>e <strong>of</strong> code. After the register til<strong>in</strong>g, the K loop<strong>in</strong> Figure 7 is unrolled by a factor KU. <strong>The</strong> result is a straight-l<strong>in</strong>ecode that conta<strong>in</strong>s KU copies <strong>of</strong> the micro-MMM code. Figure 9-(a) shows two copies <strong>of</strong> the micro-MMM that corresponds to theunrolls MU=4 and NU=2 shown <strong>in</strong> Figure 7. Notice that the degree<strong>of</strong> unroll MUxNU determ<strong>in</strong>es the number <strong>of</strong> FP registers required.This number is MUxNU to hold the C values, MU to hold the Avalues, and NU to hold the B values.MUKANUKTile SizeBCTileSize<strong>for</strong> (<strong>in</strong>t j =0; j


For this experiment the rest <strong>of</strong> the parameters <strong>of</strong> the m<strong>in</strong>iMMMhave been set to the values that ATLAS found to be optimal [24].In particular, TileSize and KU have been set to 64 1 ; that is, the<strong>in</strong>nermost k loop <strong>in</strong> Figure 7 is totally unrolled. Notice that <strong>for</strong> theorder that ATLAS uses <strong>for</strong> register block<strong>in</strong>g, while unroll<strong>in</strong>g alongthe k dimension does reduces loop overheads, it does not <strong>in</strong>creaseregister pressure.MFlopsMIPSPro MIN MINSched GCC500450400350300250200150100502x2 2x4 4x2 4x4 4x5 5x4 4x6 6x4 5x5 5x6 6x5 4x8 8x4 6x6 6x8 8x6 7x7 7x8 8x7 8x8Degree <strong>of</strong> unrollFigure 8. Per<strong>for</strong>mance versus MUxNU unroll <strong>for</strong> the m<strong>in</strong>i-MMM.Figure 8 shows that, as be<strong>for</strong>e, GCC has the worst per<strong>for</strong>mance.In particular the sharp drop <strong>for</strong> unrolls 6x8 and larger are due tothe size <strong>of</strong> the code, that overflows the 32KB <strong>in</strong>struction cache <strong>of</strong>the MIPS R12000 processor. We now focus on MIPSPro. Figure 8also shows that MIN behaves almost like MIPSPro when the MUand NU are small and, as a result, there is no register replacement.Thus, the slightly better per<strong>for</strong>mance <strong>of</strong> MIN is mostly due to differences<strong>in</strong> <strong>in</strong>struction schedul<strong>in</strong>g. <strong>Register</strong> replacement occurs<strong>for</strong> unrolls 4x4 and larger <strong>in</strong> MIPSPro, and unrolls 4x5 and larger<strong>in</strong> MIN. Un<strong>for</strong>tunately, <strong>for</strong> unrolls 4x6, where there are more valuesthan registers and register replacement occurs, MIN per<strong>for</strong>msworse than MIPSPro.<strong>The</strong> MIN algorithm per<strong>for</strong>ms worse because <strong>of</strong> the particularschedul<strong>in</strong>g <strong>of</strong> the operations <strong>in</strong> the MM <strong>in</strong> ATLAS. <strong>The</strong> certa<strong>in</strong>code schedul<strong>in</strong>g <strong>in</strong>curs a register spill <strong>for</strong> each replacement andadditional dependences. Figure 9-(a) shows the micro-MMM generatedby ATLAS <strong>for</strong> an unroll <strong>of</strong> 4x2 that is the <strong>in</strong>put to our compiler.<strong>The</strong> code <strong>in</strong> Figure 9-(b) is the result<strong>in</strong>g assembly code afterour compiler does register allocation <strong>for</strong> the first few <strong>in</strong>structions.For the example we have assumed that we have only 6 FP registers.It can be seen that when register replacement starts (at statement3), the farthest next use is the variable ½ that we have justcomputed. For this particular schedul<strong>in</strong>g, variables always havethe furthest next use. <strong>The</strong> registers hold<strong>in</strong>g the variables are <strong>in</strong>dirty state and, as a result, its contents need to be written back tomemory. This results <strong>in</strong> an <strong>in</strong>crease <strong>in</strong> memory traffic because, ateach register replacement, we have a register spill that generatesone ×ØÓÖ. In addition to the spills, the compiler <strong>in</strong>troduces additionaldependencies by allocat<strong>in</strong>g the same register to <strong>in</strong>dependent<strong>in</strong>structions (storage-related dependence). For <strong>in</strong>stance, althoughall the Ñ× are <strong>in</strong>dependent, due to the farthest next use <strong>of</strong> the1 ATLAS only tries square tilesMIN algorithm, we always spill the register $f4. Thus, we havecreated a cha<strong>in</strong> <strong>of</strong> dependences along the <strong>in</strong>structions us<strong>in</strong>g register$f4, as shown <strong>in</strong> Figure 9-(b). Thus, the per<strong>for</strong>mance <strong>of</strong> MINdecreases, as shown <strong>in</strong> Figure 8. We also tried our compiler us<strong>in</strong>gthe CF heuristic, but the per<strong>for</strong>mance was even worse. <strong>The</strong>CF heuristic always replaces the registers conta<strong>in</strong><strong>in</strong>g the A and Bvalues that tend to be needed shortly aga<strong>in</strong>.We looked then at the <strong>in</strong>struction schedul<strong>in</strong>g <strong>in</strong> the MIPSProassembly code. S<strong>in</strong>ce the MIPSpro assembly code was the result<strong>of</strong> <strong>in</strong>struction schedul<strong>in</strong>g and register allocation, we extractedthe schedul<strong>in</strong>g <strong>of</strong> the Ñ <strong>in</strong>structions <strong>in</strong> the MIPSpro assemblycode and obta<strong>in</strong>ed the code shown <strong>in</strong> Figure 9-(c). We ran ourcompiler on the code with the new schedul<strong>in</strong>g to do the registerallocation. <strong>The</strong> result<strong>in</strong>g assembly code is shown <strong>in</strong> Figure 9-(d).We executed this code and the per<strong>for</strong>mance obta<strong>in</strong>ed is l<strong>in</strong>e MIN-Sched <strong>in</strong> Figure 8. Now <strong>for</strong> unrolls larger than 4x6, our compilerbehaves better. As Figure 9-(d) shows, with this new schedul<strong>in</strong>g,the variables have a higher reuse rate. S<strong>in</strong>ce the registers conta<strong>in</strong><strong>in</strong>gthese variables are <strong>in</strong> dirty state, this new schedul<strong>in</strong>g helpsto reduce register spill<strong>in</strong>g.Table 4 helps to understand the results <strong>in</strong> Figure 8. For eachdegree <strong>of</strong> unroll, we show the number <strong>of</strong> l<strong>in</strong>es <strong>of</strong> the assemblycode (LOC), spills, and reloads. Table 4 shows that MINSchedalways has fewer spills and reloads than MIPSPro. As the unrollgrows and register pressure becomes more prom<strong>in</strong>ent, the fewerspills and reloads result <strong>in</strong> the better per<strong>for</strong>mance <strong>of</strong> MINSched(Figure 8). On average, <strong>for</strong> unrolls larger than 4x6, MINSchedper<strong>for</strong>ms 10% better than MIPSPro. F<strong>in</strong>ally, we also tried the CFalgorithm with the new schedul<strong>in</strong>g, but it per<strong>for</strong>med worse thanMIN, so we do not show results <strong>for</strong> it.Unroll Compiler LOC spills reloadsMIPSPro 917 0 32x4 MIN 921 0 0MINSched 921 0 0MIPSPro 1684 48 644x4 MIN 1577 0 0MINSched 1604 9 18MIPSPro 4046 438 7094x8 MIN 4673 892 892MINSched 3607 356 362Table 4. Characteristics <strong>of</strong> the MM code <strong>for</strong> different degrees <strong>of</strong>unroll.We have used the long straight-l<strong>in</strong>es code <strong>in</strong> the MM <strong>in</strong> AT-LAS as an example <strong>of</strong> where we apply register allocation. We haveshown that MINSched per<strong>for</strong>ms better than MIPSPro <strong>for</strong> large degrees<strong>of</strong> unroll. However, this improvement is not useful. <strong>The</strong>reason is that the unroll that obta<strong>in</strong>ed the best per<strong>for</strong>mance correspondsto the largest unroll be<strong>for</strong>e register replacement starts (thispo<strong>in</strong>t is 4x4 <strong>in</strong> Figure 8). As a result, ATLAS will select this unroll,where register replacement heuristics are not used.


1 c0 += a0 x b02 c1 += a1 x b03 c2 += a2 x b04 c3 += a3 x b05 c64 += a0 x b646 c65 += a1 x b647 c66 += a2 x b648 c67 += a3 x b649 c0 += a64 x b110 c1 += a65 x b111 c2 += a66 x b112 c3 += a67 x b113 c64 += a64 x b6514 c65 += a65 x b6515 c66 += a67 x b6516 c67 += a68 x b65load $f0, 0($5)load $f1, 0($6)load $f2, 0($7)1 madd $f2, $f2, $f0, $f1load $f3, 8($5)load $f4, 8($7)2 madd $f4, $f4, $f3, $f1load $f5, 16($5) Firststore $f4, 8($7) spillload $f4, 16($7)3 madd $f4, $f4, $f5, $f1store $f4, 16($7)load $f4, 24($5)store $f2, 0($7)load $f2, 24($7)4 madd $f2, $f2, $f4, $f1. . .c0 += a0 x b0c0 += a64 x b1c1 += a65 x b1c1 += a1 x b0c2 += a2 x b0c3 += a3 x b0c2 += a66 x b1c3 += a67 x b1c66 += a2 x b64c64 += a0 x b64c66 += a67 x b65c64 += a64 x b65c65 += a1 x b64c67 += a3 x b64c67 += a68 x b65c65 += a65 x b65load $f0, 0($5)load $f1, 0($6)load $f2, 0($7)madd $f2, $f2, $f0, $f1load $f3, 512($5)load $f4, 8($6)madd $f2, $f2, $f3, $f4load $f5, 520($5)store $f2, 0($7)load $f2, 8($7)madd $f2, $f2, $f5, $f4load $f5, 8($5)madd $f2, $f2, $f5, $f1. . .(a) Unscheduled code(b) MIN algorithm on theunscheduled code(c) Code schedul<strong>in</strong>gapplied by MIPSPro(d) MIN algorithm after theschedul<strong>in</strong>g applied by MIPSPro4.4 AnalysisFigure 9. Two micro-MMM codes <strong>for</strong> a m<strong>in</strong>iMMM <strong>of</strong> size 64, MU=4 and NU=2.Next we summarize our results. When the straight-l<strong>in</strong>e code issuch that the number <strong>of</strong> simultaneously live values is smaller thanthe number <strong>of</strong> registers, there is no need to do register replacement.In that case, <strong>in</strong>struction schedul<strong>in</strong>g is the dom<strong>in</strong>ant factor <strong>in</strong> optimization.This is the case <strong>for</strong> FFTs <strong>of</strong> size 4 and 8, as well as <strong>for</strong>MM, because it is possible to search <strong>for</strong> an unroll without registerspill<strong>in</strong>g or reloads.When the number <strong>of</strong> simultaneously live values is larger thanthe number <strong>of</strong> registers, register replacement becomes important.In FFT. we observed that as register pressure <strong>in</strong>creases, register allocationbecomes more important than <strong>in</strong>struction schedul<strong>in</strong>g. ForFFTs <strong>of</strong> size 32 and 64, where the number <strong>of</strong> spills and reloadsis larger, register allocation becomes important, and our compilerachieves a higher per<strong>for</strong>mance. <strong>The</strong> higher per<strong>for</strong>mance <strong>of</strong> ourcompiler also can be due to the use <strong>of</strong> the SPIRAL schedul<strong>in</strong>g togetherwith the MIN algorithm, which result <strong>in</strong> a optimal registerallocation <strong>for</strong> that schedul<strong>in</strong>g.On the other hand, <strong>for</strong> unrolls larger than 4x6, when registerpressure was high <strong>for</strong> the MM code, the use <strong>of</strong> ATLAS schedul<strong>in</strong>gand the MIN algorithm resulted <strong>in</strong> additional dependences. As aresult, MIPSPro per<strong>for</strong>med better than our compiler. It is unclearto us whether, by us<strong>in</strong>g the schedul<strong>in</strong>g <strong>in</strong> ATLAS, we could havefound an optimal register replacement able to beat the MIPSPro<strong>in</strong>struction schedul<strong>in</strong>g. However, it is clear that there are schedul<strong>in</strong>gsthat can reduce register pressure, and these schedules shouldbe used when register spills and reloads become important.In summary, by us<strong>in</strong>g our compiler with the simple MIN algorithmwe have beaten the per<strong>for</strong>mance obta<strong>in</strong>ed by the MIPSProcompiler <strong>in</strong> long straight-l<strong>in</strong>es <strong>of</strong> code, when register pressure washigh. Today’s compilers like MIPSPro or GCC are not optimizedto handle this type <strong>of</strong> codes and, as a result, highly optimized codelike those with loop unroll<strong>in</strong>g and trace schedul<strong>in</strong>g could result <strong>in</strong>sub-optimal per<strong>for</strong>mance. Per<strong>for</strong>mance could be improved with anappropriate register allocator, and maybe an <strong>in</strong>struction schedul<strong>in</strong>gchosen to m<strong>in</strong>imize register pressure.5 Related Work<strong>The</strong>re are two different approaches <strong>for</strong> register allocation:global register allocation, and local register allocation. Global registerallocation assigns variables to registers throughout the program,while local register allocation assigns registers to variableswith<strong>in</strong> basic blocks.<strong>The</strong> most commonly used register allocation is based on GraphColor<strong>in</strong>g [5]. Graph color<strong>in</strong>g is a global register allocator thattries to assign registers so that simultaneously live values are notassigned the same register. In this approach, the register allocationproblem is translated <strong>in</strong>to a graph color<strong>in</strong>g problem. Nodes<strong>in</strong> the graph represent live values that are candidates <strong>for</strong> registerallocation. Edges connect live ranges that <strong>in</strong>terfere, that is, thatsimultaneously live at least at one po<strong>in</strong>t <strong>in</strong> the program. Color<strong>in</strong>gthe graph consists <strong>of</strong> assign<strong>in</strong>g colors (registers) to the nodesso that two nodes connected by an edge do not receive the samecolor. Color<strong>in</strong>g works well when the graph is colorable. However,when the number <strong>of</strong> values is larger than the number <strong>of</strong> registers,some registers need to be spilled. S<strong>in</strong>ce color<strong>in</strong>g and spill<strong>in</strong>g is anNP-complete problem[9], some heuristics are used to pick up thecorrect register to spill [10, 4, 3, 5]. <strong>The</strong>se heuristics work well <strong>in</strong>most programs because humans tend to write programs will smallprocedures and basic blocks where register spill<strong>in</strong>g is unlikely tohappen [14]. However, the work <strong>in</strong> [16] showed that the effectiveness<strong>of</strong> graph color<strong>in</strong>g is strongly affected as the size <strong>of</strong> the basicblock <strong>in</strong>creases.


However, as shown <strong>in</strong> this paper, compiler optimization canresult <strong>in</strong> long basic blocks where register spill<strong>in</strong>g becomes necessary.In those cases, local register algorithms should per<strong>for</strong>mbetter. Local register allocation is the task <strong>of</strong> assign<strong>in</strong>g values toregisters over an entire block <strong>of</strong> straigh-l<strong>in</strong>e code so that the trafficbetween registers and memory is m<strong>in</strong>imized. Belady’s MIN [2]and Horwitz [13] are <strong>of</strong>ten treated as optimal algorithms <strong>for</strong> registerallocation <strong>in</strong> a basic block. Belady’s MIN [2] optimizes <strong>for</strong>the m<strong>in</strong>imal number <strong>of</strong> register replacements, and not <strong>for</strong> the m<strong>in</strong>imumnumber <strong>of</strong> load/stores. As a result, it may not f<strong>in</strong>d the optimalsolution. Horwitz’s algorithm m<strong>in</strong>imizes the number <strong>of</strong> loads andstores. Later algorithms [14, 15, 17] are ma<strong>in</strong>ly improvements tothe compilation efficiency. However, they are still exponential <strong>in</strong>time and space. On the other hand, Belady’s MIN algorithm runs<strong>in</strong> polynomial time, although it may not f<strong>in</strong>d the optimal solution.F<strong>in</strong>ally, L<strong>in</strong>ear Scan [18, 19, 21] is a type <strong>of</strong> global register allocationthat has become <strong>of</strong> <strong>in</strong>terest lately. Its <strong>in</strong>terest is that it isa fast technique that results <strong>in</strong> efficient code and, as a result, canbe used <strong>in</strong> dynamic compilation systems and “just-<strong>in</strong>-time” compilers.L<strong>in</strong>ear Scan can be seen as an extension <strong>of</strong> local register allocationalgorithms [7, 8, 14], which <strong>in</strong> turn derive from Belady’sMIN.<strong>The</strong> problem <strong>of</strong> register allocation and <strong>in</strong>struction schedul<strong>in</strong>g <strong>in</strong>straight-l<strong>in</strong>e code has also been studied <strong>in</strong> the literatur. In particularGoodman [11] proposes two different schedul<strong>in</strong>g algorithms:one tries to m<strong>in</strong>imize pipel<strong>in</strong>e stalls, and the other one tries to reduceregister pressure. <strong>The</strong> algorithm is chosen based on the registerpressure. This agrees with our observation <strong>in</strong> section 4.4.6 ConclusionIn this paper, we have shown that a simple algorithm like Belady’sMIN can beat the per<strong>for</strong>mance <strong>of</strong> state-<strong>of</strong> the art compilerslike the MIPSPro or GCC compilers <strong>in</strong> long straight-l<strong>in</strong>e codes.We have applied Belady’s MIN algorithm to codes correspond<strong>in</strong>gto FFTs trans<strong>for</strong>ms and Matrix Multiplication that are produced bySPIRAL and ATLAS, respectively. We have measured the per<strong>for</strong>manceby runn<strong>in</strong>g these codes on a real mach<strong>in</strong>e (a MIPS R12000processor).Our results show that Belady’s MIN algorithm is about 12%and 33% faster <strong>for</strong> FFTs <strong>of</strong> size 32 and 64. In the case <strong>of</strong> MatrixMultiplication, it can also execute faster than the the MIPSProcompiler by an average 10%. However, <strong>in</strong> this application, theunroll that achieves the best per<strong>for</strong>mance is the one without registerspill<strong>in</strong>g. Our compiler and MIPSPro per<strong>for</strong>m similarly us<strong>in</strong>gthis unroll. Our experiments show, that when the number <strong>of</strong> livevariables is smaller than the number <strong>of</strong> registers, MIPSPro and ourcompiler have similar per<strong>for</strong>mance. However, as the number <strong>of</strong>live variables <strong>in</strong>creases, register allocation seems to become moreimportant. We believe that, <strong>in</strong> this case <strong>of</strong> high register pressure,<strong>in</strong>struction schedul<strong>in</strong>g needs to be considered <strong>in</strong> concert with registerallocation so that the number <strong>of</strong> register spills and reloads canbe m<strong>in</strong>imized.References[1] A. Aho, R. Sethi, and J. D. Ullman. Compilers, Pr<strong>in</strong>ciples, Techniques,and Tools. Addison-Wesley Publish<strong>in</strong>g Comapny, 1985.[2] L. Belady. A Study <strong>of</strong> Replacement <strong>of</strong> <strong>Algorithm</strong>s <strong>for</strong> a Virtual StorageComputer. IBM Systems Journal, 5(2):78–101, 1966.[3] P. Bergner, P. Dahl, D. Engebretsen, and M. T. O’Keefe. Spill codem<strong>in</strong>imization via <strong>in</strong>terference region spill<strong>in</strong>g. In SIGPLAN Conferenceon Programm<strong>in</strong>g Language Design and Implementation, pages287–295, 1997.[4] P. Briggs, K. Cooper, and L. Torczon. Improvements to Graph Color<strong>in</strong>g<strong>Register</strong> <strong>Allocation</strong>. ACM Transactions on Programm<strong>in</strong>g Languagesand Systems, 6(3):428–455, May,1994.[5] G. Chait<strong>in</strong>. <strong>Register</strong> <strong>Allocation</strong> and Spill<strong>in</strong>g Via Graph Color<strong>in</strong>g . InProc. <strong>of</strong> the SIGPLAN 82 Symp. On Compiler Construction, pages98–105, 1982.[6] C. Fischer and T. LeBlanc. Craft<strong>in</strong>g a Compiler. Benjam<strong>in</strong> Cumm<strong>in</strong>gs,1987.[7] C. Fraser and D. Hanson. A Retargetable C Compiler: Design andImplementation . Benjam<strong>in</strong>/Cumm<strong>in</strong>gs, Redwood City, CA, 1995.[8] R. A. Freiburghouse. <strong>Register</strong> allocation via usage counts. Communications<strong>of</strong> the ACM, 17:638–642, November,1974.[9] M. Garey and D. Johnson. Computers and Intracdtability: A Guideto the <strong>The</strong>ory <strong>of</strong> NP-Completeness. W.H. Freeman and Company,New York, 1989.[10] L. George and L. Appel. Iterated <strong>Register</strong> Coalesc<strong>in</strong>g. ACM Transactionson Programm<strong>in</strong>g Languages and Systems, 18(3):300–324,May,1996.[11] J. R. Goodman and W.-C. Hsu. Code schedul<strong>in</strong>g and register allocation<strong>in</strong> large basic blocks. In Proceed<strong>in</strong>gs <strong>of</strong> the 2nd <strong>in</strong>ternationalconference on Supercomput<strong>in</strong>g, pages 442–452. ACM Press, 1988.[12] J. L. Hennessy and D. A. Patterson. Computer Architecture: A QuantitativeApproach. Morgan Kaufmann Publishers, San Francisco, CA,1996.[13] L. Horwitz, R. M. karp, R. E. Miller, and S. W<strong>in</strong>ograd. Index <strong>Register</strong><strong>Allocation</strong>. Jornal <strong>of</strong> the ACM, 13(1):43–61, January,1966.[14] W. Hsu, C. Fischer, and J. Goodman. On the M<strong>in</strong>imization <strong>of</strong>Load/Stores <strong>in</strong> Local <strong>Register</strong> <strong>Allocation</strong>. IEEE Transactions onS<strong>of</strong>tware Eng<strong>in</strong>eer<strong>in</strong>g, 15(10):1252–1260, October, 1989.[15] K. Kennedy. Index <strong>Register</strong> <strong>Allocation</strong> <strong>in</strong> Straight L<strong>in</strong>e Code andSimple Loops. Design and Optimization <strong>of</strong> Compilers, EnglewoodCliffs, NJ: Prentice Hall, 1972.[16] J. R. Larus and P. N. Hilf<strong>in</strong>ger. <strong>Register</strong> allocation <strong>in</strong> the spur lispcompiler. In Proceed<strong>in</strong>gs <strong>of</strong> the 1986 SIGPLAN symposium on Compilercontruction, pages 255–263. ACM Press, 1986.[17] F. Luccio. A Comment on Index <strong>Register</strong> <strong>Allocation</strong>. Communications<strong>of</strong> the ACM, 10(9):572–574, 1967.[18] M. Poletto, D. Engler, and M. Kaashoek. tcc: A system <strong>for</strong> fast,flexible and high-level dynamic code generation. In Proc. <strong>of</strong> the InternationalConference on Programm<strong>in</strong>g Language Design and Implementation,pages 109–121, 1997.[19] M. Poletto and V. Sarkar. L<strong>in</strong>ear scan register allocation. ACMTransactions on Programm<strong>in</strong>g Languages and Systems, 21:895–913,September,1999.


[20] M. Puschel, B. S<strong>in</strong>ger, J. Xiong, J. Moura, D. Padua, M. Veloso, andR. Johnson. SPIRAL: A Generator <strong>for</strong> Plat<strong>for</strong>m-Adapted Libraries <strong>of</strong>Signal Process<strong>in</strong>g <strong>Algorithm</strong>s. To appear <strong>in</strong> Journal <strong>of</strong> High Per<strong>for</strong>mancecomput<strong>in</strong>g and Applications. http://www.ece.cmu.edu/ spiral.[21] O. Traub, G. Holloway, and M. Smith. Quality and speed <strong>in</strong> l<strong>in</strong>earscanregister allocation. In Proc. <strong>of</strong> the International Conference onProgramm<strong>in</strong>g Language Design and Implementation, 1998.[22] R. Whaley and J. Dongarra. Automatically Tuned L<strong>in</strong>ear AlgebraS<strong>of</strong>tware. Technical Report UT CS-97-366, LAPACK Workig NoteNo. 131, University <strong>of</strong> Tenbessee, 1997.[23] J. Xiong, J. Johnson, R. Johnson, and D. Padua. Spl: A language anda compiler <strong>for</strong> dsp algorithms. In Proc. <strong>of</strong> the International Conferenceon Programm<strong>in</strong>g Language Design and Implementation, pages298–308, 2001.[24] K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran,D. Padua, K. P<strong>in</strong>gali, P. Stodghill, , and P. Wu. A comparison <strong>of</strong> empiricaland model-driven optimization. In Proc. <strong>of</strong> the InternationalConference on Programm<strong>in</strong>g Language Design and Implementation,pages 63–76, 2003.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!