29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 29<br />

HARDWARE/SOFTWARE TECHNIQUES FOR<br />

IMPROVING CACHE PERFORMANCE IN<br />

EMBEDDED SYSTEMS<br />

Gokhan Memik,Mahmut T. Kan<strong>de</strong>mir, Alok Choudhary and Ismail Kadayif<br />

1 Department of Electrical Engineering, UCLA;<br />

2 Department of Computer Science and<br />

Engineering, Penn State; 3 Department of Electrical and Computer Engineering, Northwestern<br />

University;<br />

4 Department of Computer Science and Engineering, Penn State<br />

Abstract. The wi<strong>de</strong>ning gap between processor and memory speeds ren<strong>de</strong>rs data locality<br />

optimization a very important issue in data-intensive embed<strong>de</strong>d applications. Throughout the<br />

years hardware <strong>de</strong>signers and compiler writers focused on optimizing data cache locality using<br />

intelligent cache management mechanisms and program-level trans<strong>for</strong>mations, respectively. Until<br />

now, there has not been significant research investigating the interaction between these<br />

optimizations. In this work, we investigate this interaction and propose a selective hardware/<br />

compiler strategy to optimize cache locality <strong>for</strong> integer, numerical (array-intensive), and mixed<br />

co<strong>de</strong>s. In our framework, the role of the compiler is to i<strong>de</strong>ntify program regions that can be<br />

optimized at compile time using loop and data trans<strong>for</strong>mations and to mark (at compile-time)<br />

the unoptimizable regions with special instructions that activate/<strong>de</strong>activate a hardware<br />

optimization mechanism selectively at run-time. Our results show that our technique can improve<br />

program per<strong>for</strong>mance by as much as 60% with respect to the base configuration and 17% with<br />

respect to non-selective hardware/compiler approach.<br />

Key words: cache optimizations, cache bypassing, data layout trans<strong>for</strong>mations<br />

1. INTRODUCTION AND MOTIVATION<br />

To improve per<strong>for</strong>mance of data caches, several hardware and software<br />

techniques have been proposed. Hardware approaches try to anticipate future<br />

accesses by the processor and try to keep the data close to the processor.<br />

<strong>Software</strong> techniques such as compiler optimizations [6] attempt to reor<strong>de</strong>r data<br />

access patterns (e.g., using loop trans<strong>for</strong>mations such as tiling) so that data<br />

reuse is maximized to enhance locality. Each approach has its strengths and<br />

works well <strong>for</strong> the patterns it is <strong>de</strong>signed <strong>for</strong>. So far, each of these approaches<br />

has primarily existed in<strong>de</strong>pen<strong>de</strong>ntly of one another. For example, a compilerbased<br />

loop restructuring scheme may not really consi<strong>de</strong>r the existence of a<br />

victim cache or its interaction with the trans<strong>for</strong>mations per<strong>for</strong>med. Similarly,<br />

a locality-enhancing hardware technique does not normally consi<strong>de</strong>r what<br />

software optimizations have already been incorporated into the co<strong>de</strong>. Note<br />

that the hardware techniques see the addresses generated by the processor,<br />

387<br />

A Jerraya et al. (eds.), <strong>Embed<strong>de</strong>d</strong> <strong>Software</strong> <strong>for</strong> SOC, 387–401, 2003.<br />

© 2003 Kluwer Aca<strong>de</strong>mic Publishers. Printed in the Netherlands.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!