29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Control Flow Driven Splitting of Loop Nests 225<br />

branch instructions is reduced between 8.1% (CAV Pentium) and 88.3% (ME<br />

Sun) thus leading to similar reductions of pipeline stalls (10.4%–73.1%). For<br />

the MIPS, a reduction of executed branch instructions by 66.3% (QSDPCM)<br />

– 91.8% (CAV) was observed. The very high gains <strong>for</strong> the Sun CPU are due<br />

to its complex 14-stage pipeline which is very sensitive to stalls.<br />

The results clearly show that the L1 I-cache per<strong>for</strong>mance is improved<br />

significantly. I-fetches are reduced by 26.7% (QSDPCM Pentium) – 82.7%<br />

(ME Sun), and I-cache misses are <strong>de</strong>creased largely <strong>for</strong> Pentium and MIPS<br />

(14.7%–68.5%). Almost no changes were observed <strong>for</strong> the Sun. Due to less<br />

in<strong>de</strong>x variable accesses, the L1 D-caches also benefit. D-cache fetches are<br />

reduced by 1.7% (ME Sun) – 85.4% (ME Pentium); only <strong>for</strong> QSDPCM,<br />

D-fetches increase by 3.9% due to spill co<strong>de</strong> insertion. D-cache misses drop<br />

by 2.9% (ME Sun) – 51.4% (CAV Sun). The very large register file of the<br />

Sun CPU (160 integer registers) is the reason <strong>for</strong> the slight improvements of<br />

the L1 D-cache behavior <strong>for</strong> ME and QSDPCM. Since these programs only<br />

use very few local variables, they can be stored entirely in registers even<br />

be<strong>for</strong>e loop nest splitting.<br />

Furthermore, the columns L2 Fetch and L2 Miss show that the unified<br />

L2 caches also benefit significantly, since reductions of accesses<br />

(0.2%–53.8%) and misses (1.1%–86.9%) are reported in most cases.<br />

4.2. Execution Times<br />

All in all, the factors mentioned above lead to speed-ups between 17.5% (CAV<br />

Pentium) and 75.8% (ME Sun) <strong>for</strong> the processors consi<strong>de</strong>red in the previous<br />

section (see Figure 17-5). To <strong>de</strong>monstrate that these improvements not only<br />

occur on these CPUs, additional runtime measurements were per<strong>for</strong>med <strong>for</strong><br />

an HP-9000, PowerPC G3, DEC Alpha, TriMedia TM-1000, TI C6x and an<br />

ARM7TDMI, the latter both in 16-bit thumb- and 32-bit arm-mo<strong>de</strong>.<br />

Figure 17-5 shows that all CPUs benefit from loop nest splitting. CAV is<br />

sped up by 7.7% (TI) – 35.7% (HP) with mean improvements of 23.6%. Since<br />

loop nest splitting generates very regular control flow <strong>for</strong> ME, huge gains

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!