29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Dynamic Parallelization of Array 313<br />

apply. If two successive training periods generate the same number (as the<br />

best number of processors to use)‚ we can optimistically assume that the loop<br />

access pattern is stabilized (and the best number of processors will not change<br />

anymore) and execute the remaining loop iterations using that number. Since<br />

a straight<strong>for</strong>ward application of conservative training can have a significant<br />

energy and per<strong>for</strong>mance overhead‚ this optimization should be applied with<br />

care.<br />

3.6. Exploiting history in<strong>for</strong>mation<br />

In many array-intensive applications from the embed<strong>de</strong>d image/vi<strong>de</strong>o processing<br />

domain‚ a given nest is visited multiple times. Consi<strong>de</strong>r an example<br />

scenario where L different nests are accessed within an outermost loop (e.g.‚<br />

a timing loop that iterates a fixed number of iterations and/or until a condition<br />

is satisfied). In most of these cases‚ the best processor size <strong>de</strong>termined<br />

<strong>for</strong> a given nest in one visit is still valid in subsequent visits. That is‚ we<br />

may not need to run the training period in these visits. In other words‚ by<br />

utilizing the past history in<strong>for</strong>mation‚ we can eliminate most of the overheads<br />

due to running the training periods.<br />

4. EXPERIMENTS<br />

4.1. Benchmarks and simulation plat<strong>for</strong>m<br />

To evaluate our runtime parallelization strategy‚ we per<strong>for</strong>med experiments<br />

using a custom experimental plat<strong>for</strong>m. Our plat<strong>for</strong>m has two major components:<br />

an optimizing compiler and a cycle-accurate simulator. Our compiler<br />

takes a sequential program‚ compilation constraints‚ an objective function‚ and<br />

generates a trans<strong>for</strong>med program with explicit parallel loops. The simulator<br />

takes as input this trans<strong>for</strong>med co<strong>de</strong> and simulates parallel execution. For each<br />

processor‚ four components are simulated: processor‚ instruction cache‚ data<br />

cache‚ and the shared memory. When a processor is not used in executing a<br />

loop nest‚ we shut off that processor and its data and instruction caches to<br />

save leakage energy. To obtain the dynamic energy consumption in a<br />

processor‚ we used SimplePower‚ a cycle-accurate energy simulator [10].<br />

SimplePower simulates a simple‚ five-stage pipelined architecture and captures<br />

the switching activity on a cycle-by-cycle basis. Its accuracy has been validated<br />

to be within around 9% of a commercial embed<strong>de</strong>d processor. To obtain<br />

the dynamic energy consumptions in instruction cache‚ data cache‚ and the<br />

shared memory‚ we use the CACTI framework [8]. We assumed that the<br />

leakage energy per cycle of an entire cache is equal to the dynamic energy<br />

consumed per access to the same cache. This assumption tries to capture the<br />

anticipated importance of leakage energy in the future as leakage becomes the<br />

dominant part of energy consumption <strong>for</strong> 0.10 micron (and below) technolo-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!