21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Model<strong>in</strong>g Multigra<strong>in</strong> Parallelism on Heterogeneous Multi-core Processors 43<br />

The execution time for the off-loaded sequential execution for sub-phase 2 <strong>in</strong><br />

Figure 3, can be modeled as:<br />

Toffload(w2) =TAP U(w2)+Or + Os<br />

where TAP U(w2) is the time needed to complete sub-phase 2 without additional<br />

overhead. We can write the total execution time of all three sub-phases as:<br />

T = THPU(w1)+TAP U(w2)+Or + Os + THPU(w3) (3)<br />

To reduce complexity, we replace THPU(w1)+THPU(w3) withTHPU, TAP U<br />

(w2) withTAP U, andOs + Or with Ooffload. We can now rewrite Equation 3 as:<br />

T = THPU + TAP U + Ooffload<br />

The program model <strong>in</strong> Figure 3 is representative of one of potentially many<br />

phases <strong>in</strong> a program. We further modify Equation 4 for a program with N phases:<br />

T =<br />

N�<br />

i=1<br />

3.2 Model<strong>in</strong>g Parallel Execution on APUs<br />

(2)<br />

(4)<br />

(THPU,i + TAP U,i + Ooffload) (5)<br />

Each off-loaded part of a phase may conta<strong>in</strong> f<strong>in</strong>e-gra<strong>in</strong> parallelism, such as tasklevel<br />

parallelism <strong>in</strong> nested procedures or data-level parallelism <strong>in</strong> loops. This<br />

parallelism can be exploited by us<strong>in</strong>g multiple APUs for the offloaded workload.<br />

Figure 4 shows the execution time decomposition for execution us<strong>in</strong>g one APU<br />

and two APUs. We assume that the code off-loaded to an APU dur<strong>in</strong>g phase i,<br />

has a part which can be further parallelized across APUs, and a part executed<br />

sequentially on the APU. We denote TAP U,i(1, 1) as the execution time of the<br />

further parallelized part of the APU code dur<strong>in</strong>g the ith phase. The first <strong>in</strong>dex<br />

1 refers to the use of one HPU thread <strong>in</strong> the execution. We denote TAP U,i(1,p)<br />

as the execution time of the same part when p APUs are used to execute this<br />

part dur<strong>in</strong>g the ith phase. We denote as CAP U,i the non-parallelized part of APU<br />

code <strong>in</strong> phase i. Therefore, we obta<strong>in</strong>:<br />

TAP U,i(1,p)= TAP U,i(1, 1)<br />

+ CAP U,i<br />

(6)<br />

p<br />

Given that the HPU off-loads to APUs sequentially, there exists a latency gap<br />

between consecutive off-loads on APUs. Similarly, there exists a gap between<br />

receiv<strong>in</strong>g or committ<strong>in</strong>g return values from two consecutive off-loaded procedures<br />

on the HPU. We denote with g the larger of the two gaps. On a system with p<br />

APUs, parallel APU execution will <strong>in</strong>cur an additional overhead as large as p · g.<br />

Thus, we can model the execution time <strong>in</strong> phase i as:<br />

Ti(1,p)=THPU,i + TAP U,i(1, 1)<br />

+ CAP U,i + Ooffload + p · g (7)<br />

p

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!