29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

224 Chapter 17<br />

vidual serve <strong>for</strong> the generation of the conditions of the splitting-if and lead<br />

to the minimization of if-statement executions.<br />

4. BENCHMARKING RESULTS<br />

All techniques presented here are fully implemented using the SUIF [24],<br />

Polylib [23] and PGAPack [14] libraries. Both GA’s use the <strong>de</strong>fault parameters<br />

provi<strong>de</strong>d by [14] (population size 100, replacement fraction 50%, 1,000<br />

iterations). Our tool was applied to three multimedia programs. First, a cavity<br />

<strong>de</strong>tector <strong>for</strong> medical imaging (CAV [3]) having passed the DTSE methodology<br />

[9] is used. We apply loop nest splitting to this trans<strong>for</strong>med application <strong>for</strong><br />

showing that we are able to remove the overhead introduced by DTSE without<br />

undoing the effects of DTSE. The other benchmarks are the MPEG4 full<br />

search motion estimation (ME [8], see Figure 17-1) and the QSDPCM algorithm<br />

[22] <strong>for</strong> scene adaptive coding.<br />

Since all polyhedral operations [23] have exponential worst case complexity,<br />

loop nest splitting also is exponential overall. Yet, the effective<br />

runtimes of our tool are very low, between 0.41 (QSDPCM) and 1.58 (CAV)<br />

CPU seconds are required <strong>for</strong> optimization on an AMD Athlon (1.3 GHz).<br />

For obtaining the results presented in the following, the benchmarks are<br />

compiled and executed be<strong>for</strong>e and after loop nest splitting. Compilers are<br />

always invoked with all optimizations enabled so that highly optimized co<strong>de</strong><br />

is generated.<br />

Section 4.1 illustrates the impacts of loop nest splitting on CPU pipeline<br />

and cache behavior. Section 4.2 shows how the runtimes of the benchmarks<br />

are affected by loop nest splitting <strong>for</strong> ten different processors. Section 4.3<br />

shows in how far co<strong>de</strong> sizes increase and <strong>de</strong>monstrates that our techniques<br />

are able to reduce the energy consumption of the benchmarks consi<strong>de</strong>rably.<br />

4.1. Pipeline and Cache Behavior<br />

Figure 17-4 shows the effects of loop nest splitting observed on an Intel<br />

Pentium III, Sun UltraSPARC III and a MIPS R10000 processor. All CPUs<br />

have a single main memory but separate level 1 instruction and data caches.<br />

The off-chip L2 cache is a unified cache <strong>for</strong> both data and instructions in all<br />

cases.<br />

To obtain these results, the benchmarks were compiled and executed on the<br />

processors while monitoring per<strong>for</strong>mance-measuring counters available in the<br />

CPU hardware. This way, reliable values are generated without using erroneous<br />

cache simulators. The figure shows the per<strong>for</strong>mance values <strong>for</strong> the<br />

optimized benchmarks as a percentage of the un-optimized versions <strong>de</strong>noted<br />

as 100%.<br />

The columns Branch Taken and Pipe stalls show that we are able to<br />

generate a more regular control flow <strong>for</strong> all benchmarks. The number of taken

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!