29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Compiler-Directed ILP Extraction 255<br />

Certain special purpose architectures, like transport triggered architectures,<br />

as proposed in [1], are primarily programmed by scheduling data transports,<br />

rather than the CDFG’s operations themselves. Co<strong>de</strong> generation <strong>for</strong> such<br />

architectures is fundamentally different, and har<strong>de</strong>r than co<strong>de</strong> generation <strong>for</strong><br />

the standard VLIW/EPIC processors assumed in this paper, see [2].<br />

6. EXPERIMENTAL RESULTS<br />

Compiler algorithms reported in the literature either jointly address speculation<br />

and modulo scheduling but do not consi<strong>de</strong>r clustered machines, or<br />

consi<strong>de</strong>r modulo scheduling on clustered machines but cannot handle conditionals<br />

i.e., can only address straight co<strong>de</strong>, with no notion of control speculation<br />

(see Section 5). Thus, the novelty of our approach makes it difficult to<br />

experimentally validate our results in comparison to previous work.<br />

We have however <strong>de</strong>vised an experiment that allowed us to assess two of<br />

the key contributions of this paper, namely, the proposed load-balancing-driven<br />

incremental control speculation and the phasing <strong>for</strong> the complete optimization<br />

problem. Specifically, we compared the co<strong>de</strong> generated by our algorithm<br />

to co<strong>de</strong> generated by a baseline algorithm that binds state of the art predicated<br />

co<strong>de</strong> (with predicate promotion only) [6] to a clustered machine and then<br />

modulo schedules the co<strong>de</strong>. In or<strong>de</strong>r to ensure fairness, the baseline algorithm<br />

uses the same binding and modulo scheduling algorithms implemented in our<br />

framework.<br />

We present experimental results <strong>for</strong> a representative set of critical loop<br />

kernels extracted from the MediaBench [5] suite of programs and from TI’s<br />

benchmark suite [17] (see Table 19-1). The frequency of execution of the<br />

Table 19-1. Kernel characteristics.<br />

Kernel<br />

Lsqsolve<br />

Csc<br />

Shortterm<br />

Store<br />

Ford<br />

Jquant<br />

Huffman<br />

Add<br />

Pixel1<br />

Pixel2<br />

Intra<br />

Nonintra<br />

Vdthresh8<br />

Collision<br />

Viterbi<br />

Benchmark<br />

Rasta/Lsqsolve<br />

Mpeg<br />

Gsm<br />

Mpeg2<strong>de</strong>c<br />

Rasta<br />

Jpeg<br />

Epic<br />

Gsm<br />

Mesa<br />

Mesa<br />

Mpeg2enc<br />

Mpeg2enc<br />

TI suite<br />

TI suite<br />

TI suite<br />

Main Inner Loop from Function<br />

eliminate()<br />

csc()<br />

fast_Short_term_synthesis_filtering()<br />

conv422to444()<br />

FORD1()<br />

find_nearby_colors()<br />

enco<strong>de</strong>_stream()<br />

gsm_div()<br />

gl_write_zoomed_in<strong>de</strong>x_span()<br />

gl_write_zoomed_stencil_span()<br />

iquant1_intra()<br />

iquant1 _non_intra()

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!