21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

LPA: A First Approach to the Loop Processor Architecture 275<br />

predict fetch decode rename dispatch<br />

loop<br />

w<strong>in</strong>dow<br />

Fig. 2. LPA Architecture<br />

out-of-order<br />

execution<br />

core<br />

rename mapp<strong>in</strong>g of each loop iteration, and thus there is no need to access the<br />

rename mapp<strong>in</strong>g table and the dependence detection and resolution circuitry.<br />

Accord<strong>in</strong>g to our results, the loop w<strong>in</strong>dow is able to greatly reduce the processor<br />

energy consumption. On average, the activity of the processor front-end is<br />

reduced by 14% for SPEC<strong>in</strong>t benchmarks and by 45% for SPECfp benchmarks.<br />

In addition, the loop w<strong>in</strong>dow is able to fetch <strong>in</strong>structions at a faster rate than the<br />

normal front-end pipel<strong>in</strong>e because it is not limited by taken branches or <strong>in</strong>struction<br />

alignment <strong>in</strong> memory. However, our results show that the performance ga<strong>in</strong><br />

achievable is limited due to the size of the ma<strong>in</strong> back-end structures <strong>in</strong> current<br />

processor designs. Consequently, we evaluate the potential of our loop w<strong>in</strong>dow<br />

approach <strong>in</strong> a large <strong>in</strong>struction w<strong>in</strong>dow processor [5] with virtually unbounded<br />

structures, show<strong>in</strong>g that up to 40% performance speedup is achievable.<br />

2 The LPA Architecture<br />

The objective of our first approach to LPA is to replace the functionality of<br />

the prediction, fetch, decode, and rename stages dur<strong>in</strong>g the execution of simple<br />

loop structures, as shown <strong>in</strong> Figure 2. To do this, the renamed <strong>in</strong>structions that<br />

belong to a simple loop structure are stored <strong>in</strong> a buffer that we call loop w<strong>in</strong>dow.<br />

Once all the loop <strong>in</strong>formation required is stored <strong>in</strong> the loop w<strong>in</strong>dow, it is able to<br />

feed the dispatch logic with already decoded and renamed <strong>in</strong>structions, mak<strong>in</strong>g<br />

unnecessary all previous pipel<strong>in</strong>e stages.<br />

The loop w<strong>in</strong>dow has very simple control logic, so the implementation of this<br />

scheme has little impact on the processor hardware cost. When a backward<br />

branch is predicted taken, LPA starts loop detection. All the decoded and renamed<br />

<strong>in</strong>structions after this po<strong>in</strong>t are then stored <strong>in</strong> the loop w<strong>in</strong>dow dur<strong>in</strong>g<br />

the second iteration of the loop. If the same backward branch is found and it<br />

is taken aga<strong>in</strong>, then LPA has effectively stored the loop. The detection of data<br />

dependences is done dur<strong>in</strong>g the third iteration of the loop. When the backward<br />

branch is taken by the third time, LPA conta<strong>in</strong>s all the <strong>in</strong>formation it needs<br />

about the loop structure.<br />

From this po<strong>in</strong>t onwards, LPA is able to fetch the <strong>in</strong>structions belong<strong>in</strong>g to the<br />

loop from the loop w<strong>in</strong>dow, and thus the branch predictor and the <strong>in</strong>struction<br />

cache are not used. S<strong>in</strong>ce these <strong>in</strong>structions are already decoded, there is no need<br />

to use the decod<strong>in</strong>g logic. Moreover, the loop w<strong>in</strong>dow stores enough <strong>in</strong>formation<br />

to build the register rename map, and thus there is no need to access the rename<br />

mapp<strong>in</strong>g table and the dependence detection and resolution circuitry. Therefore,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!