01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Optimiz<strong>in</strong>g Stencil Application on Multi-thread GPU <strong>Architecture</strong> 241<br />

spends 85% <strong>of</strong> its execution time perform<strong>in</strong>g stencil computations on 3D arrays,<br />

which go through all <strong>of</strong> the four primary kernels, <strong>in</strong>clud<strong>in</strong>g Resid, Ps<strong>in</strong>v, Rprj3,<br />

and Interp. TheResid kernel computes the residual. The Ps<strong>in</strong>v kernel computes<br />

the approximate <strong>in</strong>verse. The Rprj3 kernel computes the projection from f<strong>in</strong>e<br />

grid to coarse grid. The Interp kernel computes the <strong>in</strong>terpolation from coarse<br />

grid to f<strong>in</strong>e grid.<br />

4.1 Effects <strong>of</strong> Improv<strong>in</strong>g Thread Utilization<br />

Here we exam<strong>in</strong>e the effectiveness <strong>of</strong> improv<strong>in</strong>g thread utilization. Apply<strong>in</strong>g the<br />

strategies proposed <strong>in</strong> Subsection 3.1, we used four thread granularities: double,<br />

double2, double2 plus two output streams, and double2 plus four output streams.<br />

If a thread us<strong>in</strong>g double can compute N po<strong>in</strong>ts, then a thread us<strong>in</strong>g double2,<br />

double2 plus two output streams and double2 plus four output streams can<br />

compute 2N, 4N, and 8N po<strong>in</strong>ts, respectively. For conveniences sake, we used N,<br />

2N, 4N and 8N to denote different thread granularities.<br />

Fig. 2(a) shows the speedup <strong>of</strong> the Resid kernel over the CPU implementation<br />

under different problem sizes us<strong>in</strong>g the four thread granularities. Note that s<strong>in</strong>ce<br />

our optimization strategies target stencil computations <strong>in</strong> Mgrid, whenevaluat<strong>in</strong>g<br />

a s<strong>in</strong>gle kernel, we do not count <strong>in</strong> the time consumed by load<strong>in</strong>g the kernels<br />

to the GPU and by the periodical communication subrout<strong>in</strong>e Comm3 <strong>in</strong> each<br />

kernel. Nevertheless, <strong>in</strong> our overall evaluation for Mgrid, all the time will be<br />

counted.<br />

As shown <strong>in</strong> Fig. 2(a), the thread granularity 8N demonstrates the best performance<br />

under problem size 256 3 and 128 3 , while 4N yields the best performance<br />

under the other two problem sizes. Problem sizes smaller than 32 3 are not shown<br />

<strong>in</strong> the figure s<strong>in</strong>ce their speedups are less than one. We can see that under large<br />

problem sizes 256 3 and 128 3 , the speedup for each kernel scales up with the<br />

granularity. This is because under large problem size, large thread granularity<br />

provides more chances to exploit <strong>in</strong>termediate result reuse with<strong>in</strong> threads, yet<br />

there are enough threads to exploit parallelism among threads. However, under<br />

small problem size, large thread granularity requires more GPRs <strong>in</strong> each thread,<br />

thus impact<strong>in</strong>g the number <strong>of</strong> threads and result<strong>in</strong>g <strong>in</strong> limited parallelism among<br />

threads. Under problem size 64 3 and 32 3 , the speedups first scale with the granularity,<br />

reach a maximum <strong>in</strong> granularity 4N, and then decrease <strong>in</strong> granularity<br />

8N, see Fig. 2(a). The performance ga<strong>in</strong> through hitt<strong>in</strong>g more data reuse with<strong>in</strong><br />

each thread us<strong>in</strong>g large granularity (8N) is <strong>of</strong>fset by the performance loss caused<br />

by lack<strong>in</strong>g enough threads to fully occupy parallel stream comput<strong>in</strong>g cores.<br />

Fig. 2(b) shows the speedup <strong>of</strong> each kernel and the whole application Mgrid<br />

under the largest problem size 256 3 . We can see that the speedups <strong>of</strong> the kernel<br />

Resid and Ps<strong>in</strong>v scale up with the thread granularity monotonously. This is<br />

because large thread granularity is favorable for <strong>in</strong>termediate data reuse with<strong>in</strong><br />

each thread, while there are abundant threads for thread level parallelism.<br />

Interp computes the coarse grid through access<strong>in</strong>g the f<strong>in</strong>er grid, so it tends to<br />

be an ALU-<strong>in</strong>tensive kernel. The speedup <strong>of</strong> the Interp kernel <strong>in</strong>creases rapidly<br />

with the thread granularity and reaches a maximum <strong>in</strong> thread granularity 4N, see

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!