Lecture Notes in Computer Science 4917
Lecture Notes in Computer Science 4917
Lecture Notes in Computer Science 4917
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
394 Y. Ben Asher et al.<br />
1. Ideally, the best reorder<strong>in</strong>g trace starts with the hot basic block BB1 <strong>in</strong> foo,<br />
followed by the hot basic blocks BB5 and BB7 <strong>in</strong> bar itself, and f<strong>in</strong>ally the<br />
hot basic blocks BB2 and BB4 that follow the call <strong>in</strong>struction to bar <strong>in</strong> foo.<br />
This trace reorder<strong>in</strong>g is given <strong>in</strong> Figure 7b. Unfortunately, although this is<br />
the ideal reorder<strong>in</strong>g trace, there is an extra jump <strong>in</strong>struction to BB2 that we<br />
are forced to add immediately after the call <strong>in</strong>struction to bar. Theextra<br />
jump <strong>in</strong>struction is necessary to ma<strong>in</strong>ta<strong>in</strong> the orig<strong>in</strong>al program correctness,<br />
so that the return <strong>in</strong>struction <strong>in</strong> bar will cont<strong>in</strong>ue to BB2.<br />
2. In order to avoid the extra jump <strong>in</strong>struction, it is possible to form two<br />
reorder<strong>in</strong>g traces. A trace consist<strong>in</strong>g of the hot basic blocks <strong>in</strong> foo: BB1,BB2,<br />
and BB4, is followed by a second trace consist<strong>in</strong>g of the hot basic blocks <strong>in</strong><br />
bar: BB5 and BB7. The result<strong>in</strong>g code for this selection of reorder<strong>in</strong>g traces<br />
is shown <strong>in</strong> Figure 7c. Although this selection does not generate extra jumps<br />
for ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g correctness, it does not reflect the true control flow of the<br />
program, as it avoids creat<strong>in</strong>g traces that can cross function boundaries.<br />
Figure 7 shows that after <strong>in</strong>l<strong>in</strong><strong>in</strong>g bar at the call site <strong>in</strong> foo, the code reorder<strong>in</strong>g<br />
creates the optimal hot path without the extra jump or return <strong>in</strong>structions,<br />
and by follow<strong>in</strong>g the true control flow. Furthermore, function <strong>in</strong>l<strong>in</strong><strong>in</strong>g <strong>in</strong>creases<br />
the average size of each reorder<strong>in</strong>g trace. In Figure 7, the reorder<strong>in</strong>g trace size<br />
after function <strong>in</strong>l<strong>in</strong><strong>in</strong>g <strong>in</strong>cludes six <strong>in</strong>structions, which is longer than each of the<br />
reorder<strong>in</strong>g traces BB1,BB2,BB3 or BB5,BB7, shown <strong>in</strong> Figure 7c.<br />
The longer the traces produced by code reorder<strong>in</strong>g, the better the program<br />
locality. We assert that the average size of traces created before aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g<br />
vs. the average size after <strong>in</strong>l<strong>in</strong><strong>in</strong>g can serve as a measure for the improvement<br />
to the reverse effect where <strong>in</strong>l<strong>in</strong>e helps code reorder<strong>in</strong>g. In general, the average<br />
size of traces <strong>in</strong>creases due to function <strong>in</strong>l<strong>in</strong><strong>in</strong>g: traces that started to grow <strong>in</strong> a<br />
certa<strong>in</strong> function can now grow <strong>in</strong>to the correspond<strong>in</strong>g <strong>in</strong>l<strong>in</strong>ed callee functions.<br />
4.1 Synergy Experimental Results<br />
The follow<strong>in</strong>g experimental results demonstrate this “reverse effect” and synergy<br />
between function <strong>in</strong>l<strong>in</strong><strong>in</strong>g and code reorder<strong>in</strong>g on the Power4. The Power4 has<br />
an extensive set of Performance Counters (PMCs), that enable us to isolate<br />
the reasons for a program’s behavior. These count the L1 Icache fetches and<br />
branch target mispredictions. Figure 8 shows the improvements of reduced L1<br />
Icache fetches (percentage) compar<strong>in</strong>g code reorder<strong>in</strong>g and the comb<strong>in</strong>ation of<br />
<strong>in</strong>l<strong>in</strong><strong>in</strong>g and code reorder<strong>in</strong>g (denoted as “aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g”). The average<br />
improvement due to <strong>in</strong>l<strong>in</strong><strong>in</strong>g is from 16% to 24%. More significant results are<br />
presented <strong>in</strong> Figure 9, which shows the percentage of reduced branch target<br />
mispredictions. The code order<strong>in</strong>g scheme adds extra branches, which cause extra<br />
target mispredictions. Apply<strong>in</strong>g the aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g scheme removes many<br />
of these branches and reduces the number of target mispredictions for most<br />
applications. The direction mispredictions are reduced as well, albeit at a lower<br />
rate (3%) than the target mispredictions.<br />
We have also tested the synergy between code reorder<strong>in</strong>g and <strong>in</strong>l<strong>in</strong><strong>in</strong>g on the<br />
PowerPC 405 processor used for embedded systems. The PowerPC 405 is the core