29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Dynamic Parallelization of Array 309<br />

3. RUNTIME PARALLELIZATION<br />

3.1. Approach<br />

Let I be the set of iterations <strong>for</strong> a given nest that we want to parallelize. We<br />

use (a subset of I) <strong>for</strong> <strong>de</strong>termining the number of processors to use in<br />

executing the iterations in set The set called the training set, should<br />

be very small in size as compared to the iteration set, I. Suppose that we<br />

have K different processor sizes. In this case, the training set is divi<strong>de</strong>d into<br />

K subsets, each of which containing iterations (assuming that K divi<strong>de</strong>s<br />

evenly).<br />

Let <strong>de</strong>note the processor size used <strong>for</strong> the kth trial, where 1 = k = K. We<br />

use to express the execution time of executing a set of iterations <strong>de</strong>noted<br />

by J using Now, the original execution time of a loop with iteration set I<br />

using a single processor can be expressed as Applying our trainingbased<br />

runtime strategy gives an execution time of<br />

In this expression‚ the first component gives the training time and the second<br />

component gives the execution time of the remaining iterations with the best<br />

number of processors selected by the training phase<br />

Consequently‚<br />

is the extra time incurred by our approach compared<br />

to the best execution time possible. If successful‚ our approach reduces this<br />

difference to minimum.<br />

Our approach can also be used <strong>for</strong> objectives other than minimizing execution<br />

time. For example‚ we can try to minimize energy consumption or<br />

energy-<strong>de</strong>lay product. As an example‚ let be the energy consumption<br />

of executing a set of iterations <strong>de</strong>noted by J using processors. In this case‚<br />

gives the energy consumption of the nest with iteration set I when a<br />

single processor is used. On the other hand‚ applying our strategy gives an<br />

energy consumption of<br />

The second component in this expression gives the energy consumption of the<br />

remaining iterations with the best number of processors (<strong>de</strong>noted found<br />

in the training phase. Note that in general can be different from<br />

Figure 23-2 illustrates this training period based loop parallelization<br />

strategy. It should also be noted that we are making an important assumption<br />

here. We are assuming that the best processor size does not change during<br />

the execution of the entire loop. Our experience with array-intensive embed<strong>de</strong>d<br />

applications indicates that in majority of the cases, this assumption is valid.<br />

However, there exist also cases where it is not. In such cases, our approach<br />

can be slightly modified as follows. Instead of having only a single training

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!