21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

126 S. Eyerman, L. Eeckhout, and J.E. Smith<br />

avg normalized execution time<br />

1.000<br />

0.975<br />

0.950<br />

0.925<br />

0.900<br />

0.875<br />

0.850<br />

0.825<br />

0.800<br />

base<br />

tree opt<br />

const prop/elim<br />

basic loop opt<br />

if-conversion<br />

O1<br />

O2 -fnoO2<br />

CSE<br />

BB reorder<br />

strength red<br />

recursion opt<br />

<strong>in</strong>sn schedul<strong>in</strong>g<br />

strict alias<strong>in</strong>g<br />

alignment<br />

adv tree opt<br />

O2<br />

aggr loop opt<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g<br />

O3<br />

loop unroll<strong>in</strong>g<br />

software pipel<strong>in</strong><strong>in</strong>g<br />

FDO<br />

Fig. 6. Average normalized cycle counts on a superscalar <strong>in</strong>-order processor<br />

available <strong>in</strong> the pipel<strong>in</strong>e to issue because of a branch misprediction, the cycle is assigned<br />

to the branch misprediction cycle component.<br />

The result of compar<strong>in</strong>g the <strong>in</strong>-order processor cycle components aga<strong>in</strong>st the out-oforder<br />

processor cycle components is presented <strong>in</strong> Figure 4. To facilitate the discussion,<br />

we make the follow<strong>in</strong>g dist<strong>in</strong>ction <strong>in</strong> cycle components. The first group of cycle components<br />

is affected by the dynamic <strong>in</strong>struction count and the critical path of <strong>in</strong>ter-operation<br />

dependencies; these are the base, resource stall, and branch misprediction penalty cycle<br />

components. We observe from Figure 4 that these cycle components are affected more<br />

by the compiler optimizations for the <strong>in</strong>-order processor than for the out-of-order processor:<br />

14.6% versus 8.0%. The second group of cycle components are related to the<br />

L1 and L2 cache and TLB miss events and the number of branch misprediction events.<br />

This second group of miss events is affected more for the out-of-order processor: this<br />

is only 2.3% for the <strong>in</strong>-order processor versus 7% for the out-of-order processor. In<br />

other words, most of the performance ga<strong>in</strong> through compiler optimizations on an <strong>in</strong>order<br />

processor comes from reduc<strong>in</strong>g the dynamic <strong>in</strong>struction count and shorten<strong>in</strong>g the<br />

critical path of <strong>in</strong>ter-operation dependencies. On an out-of-order processor, the dynamic<br />

<strong>in</strong>struction count and the critical path are also important factors affect<strong>in</strong>g overall performance,<br />

however, about one half of the total performance speedup comes from secondary<br />

effects related to I-cache, long-latency D-cache and branch misprediction behavior.<br />

There are three reasons that support these observations. First, out-of-order execution<br />

hides part of the <strong>in</strong>ter-operation dependencies and latencies which reduces the impact<br />

of critical path optimizations. In particular, <strong>in</strong> a balanced out-of-order processor, the<br />

critical path of <strong>in</strong>ter-operation dependencies is only visible on a branch misprediction.<br />

Second, the base and resource stall cycle components are more significant for an <strong>in</strong>order<br />

processor than for an out-of-order processor; this makes the miss event cycle<br />

components relatively less significant for an <strong>in</strong>-order processor than for an out-of-order<br />

processor. As such, an improvement to these miss event cycle components results <strong>in</strong><br />

a smaller impact on overall performance for <strong>in</strong>-order processors. Third, schedul<strong>in</strong>g <strong>in</strong>structions<br />

can have a bigger impact on memory-level parallelism on an out-of-order<br />

processor than on an <strong>in</strong>-order processor. A good static <strong>in</strong>struction schedule will place<br />

<strong>in</strong>dependent long-latency D-cache and D-TLB misses closer to each other <strong>in</strong> the dynamic<br />

<strong>in</strong>struction stream. An out-of-order processor will be able to exploit the available<br />

MLP at run time <strong>in</strong> case the <strong>in</strong>dependent long-latency loads appear with<strong>in</strong> a ROB size<br />

from each other <strong>in</strong> the dynamic <strong>in</strong>struction stream. An <strong>in</strong>-order processor on the other

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!