01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

240 F. Xudong et al.<br />

threads <strong>in</strong> the wavefront have to execute the branch, which means all the paths are<br />

executed serially. This situation degrades the kernel performance greatly. Therefore<br />

branch divergences <strong>in</strong> kernels should be elim<strong>in</strong>ated as much as possible.<br />

We convert control dependence to data dependence, which caters to GPUs<br />

powerful data process<strong>in</strong>g capability [10]. Our branch elim<strong>in</strong>ation is a two-step<br />

strategy: a) Branch fusion. Our branch fusion here is only suitable for the situation<br />

where the left expressions <strong>of</strong> if and else are the same. If not, there is no<br />

benefit us<strong>in</strong>g branch fusion s<strong>in</strong>ce the expressions <strong>in</strong> both branches have to be<br />

executed. b) Expression simplification. The second step is used to simplify expressions<br />

gotten from branch fusion <strong>in</strong> the hope <strong>of</strong> elim<strong>in</strong>at<strong>in</strong>g all the redundant<br />

computations. With branch elim<strong>in</strong>ation, we can elim<strong>in</strong>ate all the eight branches<br />

<strong>in</strong> the Interp kernel.<br />

3.4 CPU-GPU Task Distribution<br />

GPUs are good at perform<strong>in</strong>g ALU-<strong>in</strong>tensive tasks, which qualifies them as good<br />

accelerators for CPUs. The philosophy <strong>of</strong> us<strong>in</strong>g GPUs to accelerate applications<br />

is to manipulate massive threads to exploit parallelism among threads and hide<br />

memory access latencies. So when the problem size is very small, there may be<br />

not enough threads to occupy stream process<strong>in</strong>g cores to fully exploit parallelism.<br />

Take the problem size 16 3 for example. Assum<strong>in</strong>g that there are enough GPRs,<br />

only 4K threads are needed to process the computation under this problem size,<br />

which is much less than the maximum 10K threads that the RV770 core can<br />

provide, not to mention the smaller problem sizes.<br />

When the speedup obta<strong>in</strong>ed by the GPU is less than one, we should consider<br />

turn<strong>in</strong>g the task back to run on the CPU. Nevertheless, port<strong>in</strong>g comput<strong>in</strong>g tasks<br />

to the CPU entails <strong>in</strong>evitable overhead such as data communication latency. This<br />

<strong>in</strong>dicates the performance ga<strong>in</strong> from distribut<strong>in</strong>g the task among the CPU and<br />

the GPU must counteract this overhead for the purpose <strong>of</strong> improv<strong>in</strong>g the overall<br />

system performance. Distribut<strong>in</strong>g tasks between the CPU and the GPU is sure<br />

to outperform the CPU- or GPU-s<strong>in</strong>gle comput<strong>in</strong>g device execution.<br />

4 Experimental Evaluation<br />

To exam<strong>in</strong>e the benefits <strong>of</strong> our optimization strategies, we implemented the<br />

Mgrid application us<strong>in</strong>g Brook+ on an AMD Radeon HD4870 GPU. All the<br />

results are compared to the s<strong>in</strong>gle-thread CPU version, which is measured on an<br />

Intel Xeon E5405 CPU runn<strong>in</strong>g at 2GHz with 256KB L1 cache and 12MB L2<br />

cache. We used the Intel ifort compiler as the CPU compiler with the optimization<br />

option -O3.<br />

Mgrid is a 3D multigrid application <strong>in</strong> the SPECfp/NAS benchmark. Notably,<br />

it is the only application found <strong>in</strong> both the SPEC and NAS benchmark suites,<br />

and among those few SPEC 2000 applications surviv<strong>in</strong>g through SPEC 95 and<br />

SPEC 98. The ma<strong>in</strong> process <strong>of</strong> Mgird is <strong>of</strong> a V-cycle pattern performed at multilevel<br />

grids <strong>in</strong> multi-pass (multiple iterations), as illustrated <strong>in</strong> Fig. 1(b). Mgrid

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!