29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 26<br />

EXPLORING HIGH BANDWIDTH PIPELINED<br />

CACHE ARCHITECTURE FOR SCALED<br />

TECHNOLOGY<br />

Amit Agarwal, Kaushik Roy and T. N. Vijaykumar<br />

Purdue University, West Lafayette, IN 47906, USA<br />

Abstract. In this article we propose a <strong>de</strong>sign technique to pipeline cache memories <strong>for</strong> high<br />

bandwidth applications. With the scaling of technology cache access latencies are multiple clock<br />

cycles. The proposed pipelined cache architecture can be accessed every clock cycle and thereby,<br />

enhances bandwidth and overall processor per<strong>for</strong>mance. The proposed architecture utilizes the<br />

i<strong>de</strong>a of banking to reduce bit-line and word-line <strong>de</strong>lay, making word-line to sense amplifier <strong>de</strong>lay<br />

to fit into a single clock cycle. Experimental results show that optimal banking allows the cache<br />

to be split into multiple stages whose <strong>de</strong>lays are equal to clock cycle time. The proposed <strong>de</strong>sign<br />

is fully scalable and can be applied to future technology generations. Power, <strong>de</strong>lay and area<br />

estimates show that on average, the proposed pipelined cache improves MOPS (millions of<br />

operations per unit time per unit area per unit energy) by 40–50% compared to current cache<br />

architectures.<br />

Key words: pipelined cache, bandwidth, simultaneous multithreading<br />

1. INTRODUCTION<br />

The phenomenal increase in microprocessor per<strong>for</strong>mance places significant<br />

<strong>de</strong>mands on the memory system. Computer architects are now exploring<br />

thread-level parallelism to exploit the continuing improvements in CMOS<br />

technology <strong>for</strong> higher per<strong>for</strong>mance. Simultaneous Multithreading (SMT) [7]<br />

has been proposed to improve system throughput by overlapping multiple<br />

(either multi-programmed or explicitly parallel) threads in a wi<strong>de</strong>-issue<br />

processor. SMT enables several threads to be executed simultaneously on a<br />

single processor, placing a substantial bandwidth <strong>de</strong>mand on the cache<br />

hierarchy, especially on L1 caches. SMT’s per<strong>for</strong>mance is limited by the rate<br />

at which the L1 caches can supply data.<br />

One way to build a high-bandwidth cache is to reduce the cache access<br />

time, which is <strong>de</strong>termined by cache size and set-associativity [3]. To reduce<br />

the access time, the cache needs to be smaller in size and less set associative.<br />

However, <strong>de</strong>creasing the size or lowering the associativity has negative impact<br />

on the cache miss rate. The cache miss rate <strong>de</strong>termines the amount of time<br />

the CPU is stalled waiting <strong>for</strong> data to be fetched from the main memory.<br />

345<br />

A Jerraya et al. (eds.), <strong>Embed<strong>de</strong>d</strong> <strong>Software</strong> <strong>for</strong> SOC, 345–358, 2003.<br />

© 2003 Kluwer Aca<strong>de</strong>mic Publishers. Printed in the Netherlands.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!