21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

394 Y. Ben Asher et al.<br />

1. Ideally, the best reorder<strong>in</strong>g trace starts with the hot basic block BB1 <strong>in</strong> foo,<br />

followed by the hot basic blocks BB5 and BB7 <strong>in</strong> bar itself, and f<strong>in</strong>ally the<br />

hot basic blocks BB2 and BB4 that follow the call <strong>in</strong>struction to bar <strong>in</strong> foo.<br />

This trace reorder<strong>in</strong>g is given <strong>in</strong> Figure 7b. Unfortunately, although this is<br />

the ideal reorder<strong>in</strong>g trace, there is an extra jump <strong>in</strong>struction to BB2 that we<br />

are forced to add immediately after the call <strong>in</strong>struction to bar. Theextra<br />

jump <strong>in</strong>struction is necessary to ma<strong>in</strong>ta<strong>in</strong> the orig<strong>in</strong>al program correctness,<br />

so that the return <strong>in</strong>struction <strong>in</strong> bar will cont<strong>in</strong>ue to BB2.<br />

2. In order to avoid the extra jump <strong>in</strong>struction, it is possible to form two<br />

reorder<strong>in</strong>g traces. A trace consist<strong>in</strong>g of the hot basic blocks <strong>in</strong> foo: BB1,BB2,<br />

and BB4, is followed by a second trace consist<strong>in</strong>g of the hot basic blocks <strong>in</strong><br />

bar: BB5 and BB7. The result<strong>in</strong>g code for this selection of reorder<strong>in</strong>g traces<br />

is shown <strong>in</strong> Figure 7c. Although this selection does not generate extra jumps<br />

for ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g correctness, it does not reflect the true control flow of the<br />

program, as it avoids creat<strong>in</strong>g traces that can cross function boundaries.<br />

Figure 7 shows that after <strong>in</strong>l<strong>in</strong><strong>in</strong>g bar at the call site <strong>in</strong> foo, the code reorder<strong>in</strong>g<br />

creates the optimal hot path without the extra jump or return <strong>in</strong>structions,<br />

and by follow<strong>in</strong>g the true control flow. Furthermore, function <strong>in</strong>l<strong>in</strong><strong>in</strong>g <strong>in</strong>creases<br />

the average size of each reorder<strong>in</strong>g trace. In Figure 7, the reorder<strong>in</strong>g trace size<br />

after function <strong>in</strong>l<strong>in</strong><strong>in</strong>g <strong>in</strong>cludes six <strong>in</strong>structions, which is longer than each of the<br />

reorder<strong>in</strong>g traces BB1,BB2,BB3 or BB5,BB7, shown <strong>in</strong> Figure 7c.<br />

The longer the traces produced by code reorder<strong>in</strong>g, the better the program<br />

locality. We assert that the average size of traces created before aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g<br />

vs. the average size after <strong>in</strong>l<strong>in</strong><strong>in</strong>g can serve as a measure for the improvement<br />

to the reverse effect where <strong>in</strong>l<strong>in</strong>e helps code reorder<strong>in</strong>g. In general, the average<br />

size of traces <strong>in</strong>creases due to function <strong>in</strong>l<strong>in</strong><strong>in</strong>g: traces that started to grow <strong>in</strong> a<br />

certa<strong>in</strong> function can now grow <strong>in</strong>to the correspond<strong>in</strong>g <strong>in</strong>l<strong>in</strong>ed callee functions.<br />

4.1 Synergy Experimental Results<br />

The follow<strong>in</strong>g experimental results demonstrate this “reverse effect” and synergy<br />

between function <strong>in</strong>l<strong>in</strong><strong>in</strong>g and code reorder<strong>in</strong>g on the Power4. The Power4 has<br />

an extensive set of Performance Counters (PMCs), that enable us to isolate<br />

the reasons for a program’s behavior. These count the L1 Icache fetches and<br />

branch target mispredictions. Figure 8 shows the improvements of reduced L1<br />

Icache fetches (percentage) compar<strong>in</strong>g code reorder<strong>in</strong>g and the comb<strong>in</strong>ation of<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g and code reorder<strong>in</strong>g (denoted as “aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g”). The average<br />

improvement due to <strong>in</strong>l<strong>in</strong><strong>in</strong>g is from 16% to 24%. More significant results are<br />

presented <strong>in</strong> Figure 9, which shows the percentage of reduced branch target<br />

mispredictions. The code order<strong>in</strong>g scheme adds extra branches, which cause extra<br />

target mispredictions. Apply<strong>in</strong>g the aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g scheme removes many<br />

of these branches and reduces the number of target mispredictions for most<br />

applications. The direction mispredictions are reduced as well, albeit at a lower<br />

rate (3%) than the target mispredictions.<br />

We have also tested the synergy between code reorder<strong>in</strong>g and <strong>in</strong>l<strong>in</strong><strong>in</strong>g on the<br />

PowerPC 405 processor used for embedded systems. The PowerPC 405 is the core

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!