01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Optimiz<strong>in</strong>g Stencil Application on Multi-thread GPU <strong>Architecture</strong> 235<br />

architectures provide easier programmability and <strong>in</strong>creased generality, abstract<strong>in</strong>g<br />

away trivial graphics programm<strong>in</strong>g details, i.e., Brook+ [1] and CUDA [2].<br />

Therefore, people beg<strong>in</strong> to harness the tremendous process<strong>in</strong>g power <strong>of</strong> GPUs<br />

for non-graphics applications. Now GPU computation has been widely adopted<br />

<strong>in</strong> the field <strong>of</strong> scientific comput<strong>in</strong>g, such as biomedic<strong>in</strong>e, computational fluid<br />

dynamics simulation, and molecular dynamics simulation [3].<br />

GPUs were orig<strong>in</strong>ally developed for graphics process<strong>in</strong>g, i.e., media applications,<br />

which have less data reuse and lay more emphasis on real time. However,<br />

<strong>in</strong> scientific applications, there can be rich data reuse while the data access patterns<br />

may not be as uniform as media applications. There is much space left for<br />

programmers to optimize the GPU codes to exploit more data reuse and hide<br />

long memory access latencies. Therefore, besides choos<strong>in</strong>g a good CPU-to-GPU<br />

application mapp<strong>in</strong>g, we should also try to optimize the GPU codes accord<strong>in</strong>g to<br />

the architecture <strong>of</strong> the GPU and the mechanisms provided by the programm<strong>in</strong>g<br />

language.<br />

In this paper, we have implemented and optimized Mgrid, a multi-grid application<br />

commonly used to solve partial differential equations (PDEs) taken from<br />

Spec2000, on an AMD GPU platform. At the heart <strong>of</strong> Mgrid are stencil (nearestneighbor)<br />

computations <strong>in</strong> each 27-po<strong>in</strong>t 3D cube. Stencil computations feature<br />

abundant parallelism and low computational <strong>in</strong>tensity which <strong>of</strong>fers great opportunity<br />

for optimization <strong>in</strong> temporal and spatial locality, mak<strong>in</strong>g them effective<br />

architectural evaluation benchmarks [4]. To optimize the naive GPU code, we<br />

have proposed four optimization strategies:<br />

(a) Improve thread utilization. Us<strong>in</strong>g vector types and multiple output streams<br />

provided by the Brook+ programm<strong>in</strong>g language, we exploit data reuse with<strong>in</strong><br />

each thread and parallelism among threads to achieve better thread utilization.<br />

(b) Stream reorganization. We reorganize the 3D data stream <strong>in</strong>to the 2D<br />

stream <strong>in</strong> the block manner to catch more data locality <strong>in</strong> the GPU cache. Stencil<br />

computations <strong>of</strong> Mgrid refer data on three consecutive planes when calculat<strong>in</strong>g<br />

a grid po<strong>in</strong>t. Through stream reorganization, we exploit the data reuse with<strong>in</strong><br />

each plane.<br />

(c) Branch elim<strong>in</strong>ation. We propose branch elim<strong>in</strong>ation to reduce the performance<br />

loss caused by branch divergences <strong>in</strong> the Interp kernel. Though chang<strong>in</strong>g<br />

the control dependence to data dependence, all <strong>of</strong> the eight branches <strong>in</strong> the<br />

kernel are elim<strong>in</strong>ated, thus improv<strong>in</strong>g the kernel performance significantly.<br />

(d) CPU-GPU workload distribution. To make full use <strong>of</strong> the CPU-GPU heterogeneous<br />

system, we should reasonably distribute the workload between the<br />

two comput<strong>in</strong>g resource accord<strong>in</strong>g to the work nature, problem size and communication<br />

overhead.<br />

Note that though our optimizations are developed for Mgrid, they can be applied<br />

to any stencil-like computations, mak<strong>in</strong>g our optimization approaches general<br />

for develop<strong>in</strong>g GPU applications. In our work, all the experiments are done<br />

under double-precision float<strong>in</strong>g-po<strong>in</strong>t implementation. The experimental results<br />

show that the optimized GPU implementation <strong>of</strong> Mgrid ga<strong>in</strong>s 2.38x speedup

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!