21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

avg normalized execution time<br />

1.000<br />

0.975<br />

0.950<br />

0.925<br />

0.900<br />

0.875<br />

0.850<br />

0.825<br />

0.800<br />

Study<strong>in</strong>g Compiler Optimizations on Superscalar Processors 121<br />

base<br />

tree opt<br />

const prop/elim<br />

basic loop opt<br />

if-conversion<br />

O1<br />

O2 -fnoO2<br />

CSE<br />

BB reorder<br />

strength red<br />

recursion opt<br />

<strong>in</strong>sn schedul<strong>in</strong>g<br />

strict alias<strong>in</strong>g<br />

alignment<br />

adv tree opt<br />

O2<br />

aggr loop opt<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g<br />

O3<br />

loop unroll<strong>in</strong>g<br />

software pipel<strong>in</strong><strong>in</strong>g<br />

FDO<br />

Fig. 3. Averaged normalized cycle counts on a superscalar out-of-order processor<br />

cycle component decrease<br />

10%<br />

9%<br />

8%<br />

7%<br />

6%<br />

5%<br />

4%<br />

3%<br />

2%<br />

1%<br />

0%<br />

out-of-order processor<br />

<strong>in</strong>-order processor<br />

base<br />

L1 I-cache<br />

L2 I-cache<br />

I-TLB<br />

L1 D-cache<br />

no/ L2 D-cache<br />

no/ D-TLB misses<br />

MLP<br />

no/ branch misses<br />

branch penalty<br />

other<br />

Fig. 4. Overall performance improvement on an out-of-order processor and an <strong>in</strong>-order processor<br />

across the various compiler sett<strong>in</strong>gs partitioned by cycle component<br />

reorder buffer is too small to susta<strong>in</strong> a given issue rate of <strong>in</strong>structions; <strong>in</strong> practice though,<br />

this is an <strong>in</strong>frequent case. Figure 4 shows the improvement <strong>in</strong> the branch resolution time<br />

across the optimization sett<strong>in</strong>gs; this is a 1.2% absolute improvement or a 7.8% relative<br />

improvement.<br />

F<strong>in</strong>ally, compiler optimizations significantly affect the number of miss events and<br />

their overlap behavior. Accord<strong>in</strong>g to Figure 4, 9.6% of the total performance improvement<br />

comes from a reduced number of branch mispredictions, and 16.7% and 19.5%<br />

of the total performance improvement comes from improved L1 I-cache and the L2 Dcache<br />

cycle components, respectively. The key observation here is that the reduced L2<br />

D-cache cycle component is almost entirely due to improved memory-level parallelism<br />

(MLP). In other words, compiler optimizations that br<strong>in</strong>g L2 cache miss loads closer<br />

to each other <strong>in</strong> the dynamic <strong>in</strong>struction stream improve performance substantially by<br />

<strong>in</strong>creas<strong>in</strong>g the amount of MLP.<br />

4.2 Compiler Optimization Analysis Case Studies<br />

We now present some case studies illustrat<strong>in</strong>g the power of <strong>in</strong>terval analysis for ga<strong>in</strong><strong>in</strong>g<br />

<strong>in</strong>sight <strong>in</strong>to how compiler optimizations affect out-of-order processor performance.<br />

Figure 5 shows normalized cycle distributions for <strong>in</strong>dividual benchmarks — we selected<br />

the benchmarks that are affected most by the compiler optimizations. These bars are

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!