21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IPC speedup<br />

30%<br />

25%<br />

20%<br />

15%<br />

10%<br />

5%<br />

0%<br />

164.gzip<br />

175.vpr<br />

176.gcc<br />

181.mcf<br />

LPA: A First Approach to the Loop Processor Architecture 283<br />

4-wide front-end with 6-wide LW<br />

6-wide front-end without LW<br />

186.crafty<br />

197.parser<br />

252.eon<br />

(a) SPEC<strong>in</strong>t<br />

253.perlbmk<br />

254.gap<br />

255.vortex<br />

256.bzip2<br />

300.twolf<br />

AVG<br />

IPC speedup<br />

30%<br />

25%<br />

20%<br />

15%<br />

10%<br />

5%<br />

0%<br />

168.wupwise<br />

171.swim<br />

4-wide front-end with 6-wide LW<br />

6-wide front-end without LW<br />

172.mgrid<br />

173.applu<br />

177.mesa<br />

178.galgel<br />

179.art<br />

183.equake<br />

187.facerec<br />

(b) SPECfp<br />

188.ammp<br />

189.lucas<br />

191.fma3d<br />

200.sixtrack<br />

301.apsi<br />

AVG<br />

Fig. 8. Performance speedup over a 4-wide front-end processor (back-end is always<br />

6-wide)<br />

achieve 4.5% speedup. The improvement is larger for SPECfp benchmarks because<br />

branch <strong>in</strong>structions are better predicted, and thus it is possible to extract<br />

more ILP.<br />

The left bar shows the speedup achieved when the front-end is still limited to<br />

4 <strong>in</strong>structions, but a loop w<strong>in</strong>dow able to fetch up to 6 <strong>in</strong>structions per cycle is<br />

<strong>in</strong>cluded. This loop w<strong>in</strong>dow allows reduc<strong>in</strong>g the front-end activity. In addition,<br />

a 4-<strong>in</strong>struction wide front-end is less complex than a 6-wide one. However, it<br />

becomes clear from these results that add<strong>in</strong>g a loop w<strong>in</strong>dow is not enough for<br />

the 4-wide front-end to achieve the performance of the 6-wide front-end. It only<br />

achieves comparable performance <strong>in</strong> a few benchmarks like 176.gcc, 200.sixtrack,<br />

and 301.apsi. On average, SPEC<strong>in</strong>t benchmarks achieve 1% IPC speedup and<br />

SPECfp achieve 2.3% speedup.<br />

The loop w<strong>in</strong>dow is not able to reach the performance of a wider processor<br />

front-end because the most important back-end structures are completely full<br />

most of the time, that is, back-end saturation is limit<strong>in</strong>g the potential benefit of<br />

our proposal. Figure 9 shows performance speedup for the same setups previously<br />

shown <strong>in</strong> Figure 8, but us<strong>in</strong>g an unbounded back-end, that is, the ROB, the issue<br />

queues, and the register file are scaled to <strong>in</strong>f<strong>in</strong>ite.<br />

The loop w<strong>in</strong>dow achieves higher performance speedups for SPEC<strong>in</strong>t and<br />

especially SPECfp benchmarks. Furthermore, the loop w<strong>in</strong>dow us<strong>in</strong>g a 4-wide<br />

front-end achieves better performance than the 6-wide front-end <strong>in</strong> several benchmarks:<br />

176.gcc (SPEC<strong>in</strong>t), 172.mgrid, 178 galgel, and 179.art. Those benchmarks<br />

have a high amount of simple dynamic loops, enabl<strong>in</strong>g the loop w<strong>in</strong>dow<br />

to fetch <strong>in</strong>structions at a faster rate than normal front-end most of time.<br />

5 Related Work<br />

To exploit <strong>in</strong>struction level parallelism, it is essential to have a large w<strong>in</strong>dow<br />

of candidate <strong>in</strong>structions available to issue. Reus<strong>in</strong>g loop <strong>in</strong>structions is a wellknown<br />

technique <strong>in</strong> this field, s<strong>in</strong>ce the temporal locality present <strong>in</strong> loops provides<br />

a good opportunity for loop cach<strong>in</strong>g. Loop buffers were developed <strong>in</strong> the sixties<br />

for the CDC 6600/6700 series [9] to m<strong>in</strong>imize the time wasted due to conditional

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!