21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

284 A. García et al.<br />

IPC speedup<br />

70%<br />

60%<br />

50%<br />

40%<br />

30%<br />

20%<br />

10%<br />

0%<br />

4-wide front-end with 6-wide LW<br />

6-wide front-end without LW<br />

164.gzip<br />

175.vpr<br />

176.gcc<br />

181.mcf<br />

186.crafty<br />

197.parser<br />

252.eon<br />

(a) SPEC<strong>in</strong>t<br />

253.perlbmk<br />

254.gap<br />

255.vortex<br />

256.bzip2<br />

300.twolf<br />

AVG<br />

IPC speedup<br />

70%<br />

60%<br />

50%<br />

40%<br />

30%<br />

20%<br />

10%<br />

0%<br />

168.wupwise<br />

171.swim<br />

4-wide front-end with 6-wide LW<br />

6-wide front-end without LW<br />

172.mgrid<br />

173.applu<br />

177.mesa<br />

178.galgel<br />

179.art<br />

183.equake<br />

187.facerec<br />

(b) SPECfp<br />

188.ammp<br />

189.lucas<br />

191.fma3d<br />

200.sixtrack<br />

301.apsi<br />

AVG<br />

Fig. 9. Performance speedup over a 4-wide front-end processor us<strong>in</strong>g an unbounded<br />

6-wide back-end<br />

branches, which could severely limit performance <strong>in</strong> programs composed of short<br />

loops. The IBM/360 model 91 <strong>in</strong>troduced the loop mode execution as a way of<br />

reduc<strong>in</strong>g the effective fetch bandwidth requirements [10,11]. When a loop is<br />

detected <strong>in</strong> an 8-word prefetch buffer, the loop mode activates. Instruction fetch<br />

from memory is stalled and all branch <strong>in</strong>structions are predicted to be taken.<br />

On average, the loop mode was active 30% of the execution time.<br />

Nowadays, hardware-based loop cach<strong>in</strong>g has been ma<strong>in</strong>ly used <strong>in</strong> embedded<br />

systems to reduce the energy consumption of the processor fetch eng<strong>in</strong>e. Lee<br />

et al. [12] describe a buffer<strong>in</strong>g scheme for simple dynamic loops with a s<strong>in</strong>gle<br />

execution path. It is based on detect<strong>in</strong>g backward branches (loop branches) and<br />

captur<strong>in</strong>g the loop <strong>in</strong>struction <strong>in</strong> a direct-mapped array (loop buffer). In this way,<br />

the <strong>in</strong>struction fetch energy consumption is reduced. In addition, a loop buffer<br />

dynamic controller (LDC) avoids penalties due to loop cache misses. Although<br />

this LDC only captures simple dynamic loops, it was recently improved [13] to<br />

detect and capture nested loops, loops with complex <strong>in</strong>ternal control-flow, and<br />

portions of loops that are too large to fit completely <strong>in</strong> a loop cache. This loop<br />

controller is a f<strong>in</strong>ite mach<strong>in</strong>e that provides more sophisticated utilization of the<br />

loop cache.<br />

Unlike the techniques mentioned above, our mechanism is not only focused<br />

on reduc<strong>in</strong>g the energy consumption of the fetch eng<strong>in</strong>e. The ma<strong>in</strong> contribution<br />

of LPA is our novel rename mapp<strong>in</strong>g build<strong>in</strong>g algorithm, which makes it possible<br />

for our proposal to reduce the consumption of the rename logic, which is<br />

one of the hot spots <strong>in</strong> processor designs. However, there is still room for improvement.<br />

The implementation presented <strong>in</strong> this paper only captures loops with<br />

a s<strong>in</strong>gle execution path. However, <strong>in</strong> general-purpose applications, 50% of the<br />

loops has variable-dependent trip counts and/or conta<strong>in</strong>s conditional branches<br />

<strong>in</strong> their bodies. Therefore, future research effort should be devoted to enhance<br />

our renam<strong>in</strong>g model for captur<strong>in</strong>g more complex structures <strong>in</strong> the loop w<strong>in</strong>dow.<br />

Although our mechanism is based on captur<strong>in</strong>g simple loops that are mostly<br />

predictable by traditional branch predictors, improv<strong>in</strong>g loop branch prediction<br />

would be beneficial for some benchmarks with loop branches that are not so<br />

predictable. In addition, advanced loop prediction would be very useful to enable<br />

LPA to capture more complex loop patterns. Many mechanisms have been

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!