21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

124 S. Eyerman, L. Eeckhout, and J.E. Smith<br />

Dur<strong>in</strong>g the analysis presented <strong>in</strong> the next discussion we will also refer to Tables 3<br />

and 4 which show the number of benchmarks for which a given compiler optimization<br />

results <strong>in</strong> a positive or negative effect, respectively, on the various cycle components.<br />

These tables also show the number of benchmarks for which the dynamic <strong>in</strong>struction<br />

count is significantly affected by the various compiler optimizations; likewise for the<br />

number of long back-end misses and their amount of MLP as well as for the number<br />

of branch mispredictions and their penalty. We do not show average performance improvement<br />

numbers <strong>in</strong> these tables because outliers make the <strong>in</strong>terpretation difficult;<br />

<strong>in</strong>stead, we treat outliers <strong>in</strong> the follow<strong>in</strong>g discussion.<br />

Basic Loop Optimizations. Basic loop optimizations move constant expressions out of<br />

the loop and simplify loop exit conditions. Most benchmarks benefit from these loop<br />

optimizations; the reasons for improved performance <strong>in</strong>clude a smaller dynamic <strong>in</strong>struction<br />

count which reduces the base cycle component. A second reason is that the<br />

simplified loop exit conditions result <strong>in</strong> a reduced branch misprediction penalty. Two<br />

benchmarks that benefit significantly from loop optimizations are perlbmk (6.7% improvement)<br />

and art (5.9% improvement). The reason for these improvements is different<br />

for the two benchmarks. For perlbmk, the reason is a reduced L1 I-cache component<br />

and a reduced branch misprediction component. The reduced L1 I-cache component is<br />

due to fewer L1 I-cache misses. The branch misprediction cycle component is reduced<br />

ma<strong>in</strong>ly because of a reduced branch misprediction penalty — the number of branch mispredictions<br />

is not affected very much. In other words, the loop optimizations reduce the<br />

critical path lead<strong>in</strong>g to the mispredicted branch so that the branch gets resolved earlier.<br />

For art on the other hand, the major cycle reduction is observed <strong>in</strong> the L2 D-cache cycle<br />

component. The reason be<strong>in</strong>g an <strong>in</strong>creased number of overlapp<strong>in</strong>g L2 D-cache misses:<br />

the number of L2 D-cache misses rema<strong>in</strong>s the same, but the reduced code footpr<strong>in</strong>t<br />

br<strong>in</strong>gs the L2 D-cache misses closer to each other <strong>in</strong> the dynamic <strong>in</strong>struction stream<br />

which results <strong>in</strong> more memory-level parallelism.<br />

If-conversion. The goal of if-conversion is to elim<strong>in</strong>ate hard-to-predict branches<br />

through predicated execution. The potential drawback of if-conversion is that more <strong>in</strong>structions<br />

need to be executed because <strong>in</strong>structions along multiple control flow paths<br />

need to be executed and part of these will be useless. Execut<strong>in</strong>g more <strong>in</strong>structions reflects<br />

itself <strong>in</strong> a larger base cycle component. In addition, more <strong>in</strong>structions need to be<br />

fetched; we observe that this also <strong>in</strong>creases the number of L1 I-cache misses for several<br />

benchmarks. Approximately half the benchmarks benefit from if-conversion; for<br />

these benchmarks, the reduction <strong>in</strong> the number of branch mispredictions outweights<br />

the <strong>in</strong>creased number of <strong>in</strong>structions that need to be executed. For the other half of the<br />

benchmarks, the ma<strong>in</strong> reason for the decreased performance is the <strong>in</strong>creased number of<br />

dynamically executed <strong>in</strong>structions.<br />

An <strong>in</strong>terest<strong>in</strong>g benchmark to consider more closely is vpr: its base, resource stall<br />

and L1 D-cache cycle components <strong>in</strong>crease by 4.5%, 9.6% and 3.9%, respectively. This<br />

analysis shows that if-conversion adds to the already very long critical path <strong>in</strong> vpr —<br />

vpr executes a tight loop with loop-carried dependencies which results <strong>in</strong> very long<br />

dependence cha<strong>in</strong>s. If-conversion adds to the critical path because registers may need<br />

to be copied us<strong>in</strong>g conditional move <strong>in</strong>structions at the reconvergence po<strong>in</strong>t. Because of<br />

this very long critical path <strong>in</strong> vpr, issue is unable to keep up with dispatch which causes

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!