01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Optimiz<strong>in</strong>g Stencil Application on Multi-thread GPU <strong>Architecture</strong> 239<br />

However, us<strong>in</strong>g vector types needs modification <strong>of</strong> <strong>in</strong>dices <strong>in</strong> the kernel body,<br />

and adopt<strong>in</strong>g multiple output streams requires splitt<strong>in</strong>g the <strong>in</strong>put streams. Also,<br />

both methods <strong>in</strong>crease the requirement for GPRs and consequently reduce the<br />

number <strong>of</strong> active threads that can be created. Whether these k<strong>in</strong>ds <strong>of</strong> overheads<br />

<strong>in</strong>curred can be <strong>of</strong>fset by the performance ga<strong>in</strong> <strong>of</strong> tun<strong>in</strong>g thread granularity is<br />

determ<strong>in</strong>ed by kernel size and specific kernel characteristics. As for what thread<br />

granularity is best, we should determ<strong>in</strong>e this through experiments.<br />

3.2 Stream Reorganization<br />

Although memory access with texture unit (memory) supports 1D, 2D and 3D<br />

address<strong>in</strong>g modes, the texture cache is optimized for 2D locality. In order to<br />

exploit more data locality <strong>in</strong> the cache, the threads <strong>in</strong> the same wavefront should<br />

read texture addresses that are close along two dimensions. This process may<br />

need transformation on data layout or data structure.<br />

Tak<strong>in</strong>g the implementation <strong>of</strong> the Resid kernel on the GPU for example, the<br />

runtime library would automatically l<strong>in</strong>early expand 3D data streams <strong>in</strong>to 2D<br />

data streams. This transformation may impact the cache performance, because<br />

the computation needs to access adjacent data <strong>in</strong> three dimensions. To get better<br />

performance, the 3D stream should be transformed <strong>in</strong>to a 2D stream <strong>in</strong> the block<br />

manner. The process is illustrated <strong>in</strong> Fig. 1(a). Compared with the layout <strong>in</strong> the<br />

l<strong>in</strong>early expand<strong>in</strong>g manner, the data adjacent <strong>in</strong> the logical space is kept adjacent<br />

<strong>in</strong> the 2D stream. The Resid kernel refers data on three consecutive planes when<br />

perform<strong>in</strong>g stencil computations for a grid po<strong>in</strong>t. After stream reorganization,<br />

we exploit the cache data locality with<strong>in</strong> each plane.<br />

4<br />

2<br />

1<br />

3<br />

1 2<br />

3 4<br />

(a)<br />

resid Level N norm3<br />

GPU<br />

CPU<br />

Level N-1<br />

rprj3<br />

rprj3<br />

Level 2<br />

rprj3<br />

Level 1<br />

resid<br />

ps<strong>in</strong>v<br />

<strong>in</strong>terp<br />

(b)<br />

Level 1<br />

<strong>in</strong>terp<br />

Level 2<br />

<strong>in</strong>terp<br />

Level N<br />

Level N-1<br />

resid<br />

ps<strong>in</strong>v<br />

resid<br />

ps<strong>in</strong>v norm3<br />

F<strong>in</strong>est Grid<br />

resid<br />

ps<strong>in</strong>v<br />

Coarsest Grid<br />

Fig. 1. (a) Transform<strong>in</strong>g the 3D stream <strong>in</strong>to 2D stream <strong>in</strong> the block manner [9] (b)<br />

V-cycle pattern <strong>of</strong> Mgrid<br />

3.3 Branch Elim<strong>in</strong>ation<br />

GPUs adopt SIMD execution mode, which <strong>in</strong>curs large flow control overhead. Take<br />

branch<strong>in</strong>g for example. AMD GPUs comb<strong>in</strong>e all the necessary paths as a wavefront.<br />

However, even if only one thread with<strong>in</strong> a wavefront diverges, the rest <strong>of</strong> the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!