29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

HW/SW are Techniques <strong>for</strong> Improving Cache Per<strong>for</strong>mance 397<br />

4.4. <strong>Software</strong> <strong>de</strong>velopment<br />

For all the simulations per<strong>for</strong>med in this study, we used two versions of the<br />

software, namely, the base co<strong>de</strong> and the optimized co<strong>de</strong>. Base co<strong>de</strong> was<br />

used in simulating the pure hardware approach. To obtain the base co<strong>de</strong>, the<br />

benchmark co<strong>de</strong>s were trans<strong>for</strong>med using an optimizing compiler that uses<br />

aggressive data locality optimizations. During this trans<strong>for</strong>mation, the highest<br />

level of optimization was per<strong>for</strong>med (–O3). The co<strong>de</strong> <strong>for</strong> the pure hardware<br />

approach is generating by turning of the data locality (loop nest optimization)<br />

using a compiler flag.<br />

To obtain the optimized co<strong>de</strong>, we first applied the data layout trans<strong>for</strong>mation<br />

explained in Section 2.1. Then, the resulting co<strong>de</strong> was trans<strong>for</strong>med using<br />

the compiler, which per<strong>for</strong>ms several locality-oriented optimizations including<br />

tiling and loop-level trans<strong>for</strong>mations. The output of the compiler (trans<strong>for</strong>med<br />

co<strong>de</strong>) is simulated using SimpleScalar. It should be emphasized that the pure<br />

software approach, the combined approach, and the selective approach all<br />

use the same optimized co<strong>de</strong>. The only addition <strong>for</strong> the selective approach<br />

was the (ON/OFF) instructions to turn on and off the hardware. To add these<br />

instructions, we first applied the algorithm explained in Section 2 to mark<br />

the locations where (ON/OFF) instructions to be inserted. Then, the data layout<br />

algorithm was applied. The resulting co<strong>de</strong> was then trans<strong>for</strong>med using the<br />

compiler. After that, the output co<strong>de</strong> of the compiler was fed into<br />

SimpleScalar, where the instructions were actually inserted in the assembly<br />

co<strong>de</strong>.<br />

5. PERFORMANCE RESULTS<br />

All results reported in this section are obtained using the cache locality<br />

optimizer in [8, 9] as our hardware optimization mechanism. Figure 29-3<br />

shows the improvement in terms of execution cycles <strong>for</strong> all the benchmarks

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!