21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

274 A. García et al.<br />

execution time<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

164.gzip<br />

175.vpr<br />

176.gcc<br />

181.mcf<br />

186.crafty<br />

197.parser<br />

252.eon<br />

253.perlbmk<br />

(a) SPEC<strong>in</strong>t<br />

254.gap<br />

255.vortex<br />

256.bzip2<br />

300.twolf<br />

AVG<br />

execution time<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

168.wupwise<br />

171.swim<br />

172.mgrid<br />

173.applu<br />

177.mesa<br />

178.galgel<br />

179.art<br />

183.equake<br />

187.facerec<br />

(b) SPECfp<br />

Fig. 1. Percentage of time execut<strong>in</strong>g simple dynamic loops<br />

processors do not have any <strong>in</strong>formation about whether or not the <strong>in</strong>dividual<br />

<strong>in</strong>structions executed belong to a loop. Indeed, when an <strong>in</strong>struction reaches the<br />

execution eng<strong>in</strong>e of the processor after be<strong>in</strong>g fetched, decoded, and renamed, it<br />

reta<strong>in</strong>s little or none algorithmic semantic <strong>in</strong>formation. Each <strong>in</strong>struction only<br />

remembers its program order, kept <strong>in</strong> a structure like the reorder buffer (ROB),<br />

as well as the basic block it belongs to support speculation.<br />

Our objective is to <strong>in</strong>troduce the semantic <strong>in</strong>formation of high-level loop structures<br />

<strong>in</strong>to the processor. A loop-conscious architecture would be able to exploit<br />

ILP <strong>in</strong> a more complexity-effective way, also enabl<strong>in</strong>g the possibility of reschedul<strong>in</strong>g<br />

<strong>in</strong>structions and optimiz<strong>in</strong>g code dynamically. However, this is not an easy<br />

design task and must be developed step by step. Our first approach to design<br />

the Loop Processor Architecture (LPA) is to capture and store already renamed<br />

<strong>in</strong>structions <strong>in</strong> a buffer that we call the loop w<strong>in</strong>dow.<br />

In order to simplify the design of our proposal, we take <strong>in</strong>to account just<br />

simple dynamic loops that execute a s<strong>in</strong>gle control path, that is, the loop body<br />

does not conta<strong>in</strong> any branch <strong>in</strong>struction whose direction changes dur<strong>in</strong>g loop<br />

execution. We have found that simple dynamic loops are frequent structures<br />

<strong>in</strong> our benchmark programs. Figure 1 shows the percentage of simple dynamic<br />

loops <strong>in</strong> the SPEC<strong>in</strong>t2000 and SPECfp2000 programs. On average, they are<br />

responsible for 28% and 60% of the execution time respectively.<br />

The execution of a simple dynamic loop implies the repetitive execution of<br />

the same group of <strong>in</strong>structions (loop <strong>in</strong>structions) dur<strong>in</strong>g each loop iteration.<br />

In a conventional processor design, the same loop branch is predicted as taken<br />

once per iteration. Any exist<strong>in</strong>g branch <strong>in</strong>side the loop body will be predicted<br />

to have the same behavior <strong>in</strong> all iterations. Furthermore, loop <strong>in</strong>structions are<br />

fetched, decoded, and renamed once and aga<strong>in</strong> up to all loop iterations complete.<br />

Such a repetitive process <strong>in</strong>volves a great waste of energy, s<strong>in</strong>ce the structures<br />

responsible for these tasks cause a great part of the overall processor energy<br />

consumption. For <strong>in</strong>stance, the first level <strong>in</strong>struction cache is responsible for<br />

10%–20% [2], the branch predictor is responsible for 10% or more [3], and the<br />

rename logic is responsible for 15% [4].<br />

The ma<strong>in</strong> objective of the <strong>in</strong>itial LPA design presented <strong>in</strong> this paper is to avoid<br />

this energy waste. S<strong>in</strong>ce the <strong>in</strong>structions are stored <strong>in</strong> the loop w<strong>in</strong>dow, there<br />

is no need to use the branch predictor, the <strong>in</strong>struction cache, and the decod<strong>in</strong>g<br />

logic. Furthermore, the loop w<strong>in</strong>dow conta<strong>in</strong>s enough <strong>in</strong>formation to build the<br />

188.ammp<br />

189.lucas<br />

191.fma3d<br />

200.sixtrack<br />

301.apsi<br />

AVG

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!