21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

LPA: A First Approach to the Loop Processor Architecture 281<br />

Table 1. Configuration of the simulated processor<br />

fetch, rename, and commit width 6 <strong>in</strong>structions<br />

<strong>in</strong>t and fp issue width 6 <strong>in</strong>structions<br />

load/store issue width 6 <strong>in</strong>structions<br />

<strong>in</strong>t, fp, and load/store issue queues 64 entries<br />

reorder buffer 256 entries<br />

<strong>in</strong>t and fp po<strong>in</strong>t registers 160 registers<br />

conditional branch predictor 64K-entry gshare<br />

branch target buffer 1024-entry 4-way<br />

RAS 32-entry<br />

L1 <strong>in</strong>struction cache 64 KB, 2-way associative, 64 byte block<br />

L1 data cache 64 KB, 2-way associative, 64 byte block<br />

L2 unified cache 1 MB, 4-way associative, 128 byte block<br />

ma<strong>in</strong> memory latency 100 cycles<br />

Our simulator models a 10-stage processor pipel<strong>in</strong>e. In order to provide results<br />

representative of current superscalar processor designs, we have configured the<br />

simulator as a 6-<strong>in</strong>struction wide processor. Table 1 shows the ma<strong>in</strong> values of our<br />

simulation setup. The processor pipel<strong>in</strong>e is modeled as described <strong>in</strong> the previous<br />

section. We have evaluated a wide range of loop-w<strong>in</strong>dow setups and chosen a<br />

128-<strong>in</strong>struction loop w<strong>in</strong>dow because it is able to capture most simple dynamic<br />

loops. Larger loop w<strong>in</strong>dows would only provide marg<strong>in</strong>al benefits that do not<br />

compensate the <strong>in</strong>creased implementation cost.<br />

We feed our simulator with traces of 300 million <strong>in</strong>structions collected from<br />

the SPEC2000 <strong>in</strong>teger and float<strong>in</strong>g po<strong>in</strong>t benchmarks us<strong>in</strong>g the reference <strong>in</strong>put<br />

set. Benchmarks were compiled on a DEC Alpha AXP 21264 [7] processor with<br />

Digital UNIX V4.0 us<strong>in</strong>g the standard DEC C V5.9-011, Compaq C++ V6.2-<br />

024, and Compaq Fortran V5.3-915 compilers with -O2 optimization level. To<br />

f<strong>in</strong>d the most representative execution segment we have analyzed the distribution<br />

of basic blocks as described <strong>in</strong> [8].<br />

4 LPA Evaluation<br />

In this section, we evaluate our loop w<strong>in</strong>dow approach. We show that the loop<br />

w<strong>in</strong>dow is able to provide great reductions <strong>in</strong> the processor front-end activity.<br />

We also show that the performance ga<strong>in</strong>s achievable are limited by the size of<br />

the ma<strong>in</strong> back-end structures. However, if an unbounded back-end is available,<br />

the loop w<strong>in</strong>dow can provide important performance speedups.<br />

4.1 Front-End Activity<br />

The loop w<strong>in</strong>dow replaces the functionality of the prediction, fetch, decode, and<br />

rename pipel<strong>in</strong>e stages, allow<strong>in</strong>g to reduce their activity. Figure 7 shows the activity<br />

reduction achieved for both SPEC<strong>in</strong>t and SPECfp 2000 benchmarks. On<br />

average, the loop w<strong>in</strong>dow reduces the front-end activity by 14% for SPEC<strong>in</strong>t

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!